Web - Amazon

We provide Linux to the World


We support WINRAR [What is this] - [Download .exe file(s) for Windows]

CLASSICISTRANIERI HOME PAGE - YOUTUBE CHANNEL
SITEMAP
Audiobooks by Valerio Di Stefano: Single Download - Complete Download [TAR] [WIM] [ZIP] [RAR] - Alphabetical Download  [TAR] [WIM] [ZIP] [RAR] - Download Instructions

Make a donation: IBAN: IT36M0708677020000000008016 - BIC/SWIFT:  ICRAITRRU60 - VALERIO DI STEFANO or
Privacy Policy Cookie Policy Terms and Conditions
Data-snooping bias - Wikipedia, the free encyclopedia

Data-snooping bias

From Wikipedia, the free encyclopedia

In statistics, data-snooping bias is a form of statistical bias generated by the misuse of data mining techniques which can lead to bogus results in scientific research. Although data-snooping biases can occur in any field that uses data mining, data snooping biases are a particular concern in finance and medical research, both of which make heavy use of data mining techniques.

In the process of data mining, huge numbers of hypotheses about a single data set can be tested in a very short time, by exhaustively searching for combinations of variables that might show a correlation.

Because conventional tests of statistical significance are based on the probability that an observation arose by chance, it is reasonable to expect that 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 0.1% will turn out to be significant at the 0.1% signficance level, and so on, simply by chance.

Thus, given enough hypotheses tested, it is virtually certain that some of them will appear to be highly statistically significant, even on a data set with no real correlations at all. Researchers who are using data mining techniques can be easily misled by these apparently significant results, even though they are merely chance artifacts.

One way to think about Data-snooping is as the psychological approach to data-analysis of "I don't care what my hypothesis turns out to be." Thus, examining the data is reduced to a problem of formulating a class of hypotheses such that one is bound to be true for that data. In cases where the data-set cannot be replaced with a separate collection, this dishonesty makes it difficult to realize that the hypthesis so produced is spurious. For example, in a list of 366 people, at least two are guaranteed to share a birthday, let's say on a particular Mary Jane and John Smith. A data-snooping hypothesis would seek to find something special about the two (for example, perhaps they are the youngest and the oldest; perhaps they are the only two who have met exactly once before; exactly twice before; exactly three times before; perhaps they are the only two with a father who has the same first name; a mother who has the same first name; etc, etc, etc.) By mentally going through hundreds, or perhaps thousands, of potential, very interesting hypotheses that each have a low-probability of being true, we can find one that is. Let's say that for this data-group it turns out that John and Mary are the only two who switched minors three times in college, a fact we found out by exhaustively comparing their life's histories. Our hypothesis can then become "Being born on August 7th results in a much higher than average chance of switching minors more than twice in college"! Indeed, turning to the data, we are helpless to see that it very strongly supports that correlation, since not one of the other people (with a different birthday) had switched minors three times in college, whereas BOTH of the people with an August 7th birthday had! Turning to the general population, we attempt to reproduce the results, by selecting for August 7th birthdates, and find that no such correlation can be extrapolated. Why? Because in this example we have become victims of the data-snooper, who only chose whatever obscure fact happened to be true for that particular data-set.

[edit] External links

Our "Network":

Project Gutenberg
https://gutenberg.classicistranieri.com

Encyclopaedia Britannica 1911
https://encyclopaediabritannica.classicistranieri.com

Librivox Audiobooks
https://librivox.classicistranieri.com

Linux Distributions
https://old.classicistranieri.com

Magnatune (MP3 Music)
https://magnatune.classicistranieri.com

Static Wikipedia (June 2008)
https://wikipedia.classicistranieri.com

Static Wikipedia (March 2008)
https://wikipedia2007.classicistranieri.com/mar2008/

Static Wikipedia (2007)
https://wikipedia2007.classicistranieri.com

Static Wikipedia (2006)
https://wikipedia2006.classicistranieri.com

Liber Liber
https://liberliber.classicistranieri.com

ZIM Files for Kiwix
https://zim.classicistranieri.com


Other Websites:

Bach - Goldberg Variations
https://www.goldbergvariations.org

Lazarillo de Tormes
https://www.lazarillodetormes.org

Madame Bovary
https://www.madamebovary.org

Il Fu Mattia Pascal
https://www.mattiapascal.it

The Voice in the Desert
https://www.thevoiceinthedesert.org

Confessione d'un amore fascista
https://www.amorefascista.it

Malinverno
https://www.malinverno.org

Debito formativo
https://www.debitoformativo.it

Adina Spire
https://www.adinaspire.com