datamining   12546

« earlier    

Library Juice » Data Mining
Austin et al. point out that the statistical methods that are at the heart of data mining are not able to distinguish real from spurious associations. Data mining employs the automated examination of enormous bodies of data. Its usefulness is thought to be proportional to the size of the data set that it collates; however, as the data set becomes larger and as the number of attributes that serve as potential relata increases, the number of potential relationships increases exponentially. Importantly, the number of spurious associations also increases. With enough data, no significance test will be stringent enough to provide assurance against the kind of results found in Austin et al. What is needed, according to Austin et al. is a “pre-specified plausible hypothesis.” For statistical analysis to be useful, the researcher must begin with a hypothesis, preferably a plausible one, if the research is to be valuable.

What exactly is a pre-specified plausible hypothesis and how can we generate it if data mining can’t do that for us? The question was posed some sixty years ago by the philosopher Nelson Goodman using different terms: Goodman believed that a critical question for epistemology was to distinguish between “projectible and non-projectible hypotheses.” One can more or less replace “pre-specified plausible hypothesis” with Goodman’s term “projectible hypothesis.” According to Goodman, when we seek to understand what hypothesis is (or is not) projectible, we do not come to the problem “empty-headed but with some stock of knowledge” which we use to determine what is (or is not) projectible. Projectible hypotheses will be those which do not conflict with other hypotheses that have been supported in the past. They will commonly use the same terminology of previously supported hypotheses. The terminology appearing in the hypotheses will have become “entrenched” in the language. This goes a long distance toward explaining why we don’t find the link between one’s astrological sign and medical conditions plausible. Twenty-first century Western medicine is not accustomed to linking astrological signs to ailments and so must find any hypothesis that does so implausible.

If Goodman is correct, then data mining is of little use without an historical understanding of the field of science to which the data pertains.
datamining  statistics  knowledge  digitalhumanities 
12 hours ago by rybesh
What is Apache Hadoop?
O'Reilly examines the components of the Hadoop ecosystem.
apache  hadoop  data  datamining 
19 hours ago by garrettc
tf–idf - Wikipedia
term frequency–inverse document frequency: how to find relavant terms from a set of overlapping terms
algorithm  data  datamining  nlp  wikipedia  geo  wp 
yesterday by torsten
the Art of R Programming [guest post] « Xi'an's Og
the Art of R Programming: lots of gems, including parallel R A Valentine gift for data scientists
datamining  from twitter_favs
3 days ago by leecarrot
Smart Content Re-viewed: Text Analytics and Semantic Content Enrichment
"There are other solution providers in the content analytics meets semantic annotation/enrichment game. In addition to IBM and Ontotext, they include HP Autonomy, MarkLogic, OpenText, Temis, and the nascent, open-source IKS project. Other vendors offer enterprise-strength building blocks, for instance, SAS via the various SAS Text Analytics components."
text-analytics  NLP  datamining  visualization  content-analytics  content-enrichment  semantic-content-enrichment  linkeddata  ontologies 
5 days ago by jschneider
Take Advantage of Twitter Search Operators — Online Collaboration
The positive and negative attitude filters simply find people using the smiling and frowning emoticons; however, it does seem to include several varieties of each emoticon. I would use it as a quick way to find positive or negative mentions, but I wouldn’t use it for any kind of measurement. For example:
twitter  datamining 
6 days ago by jmck

« earlier    

Copy this bookmark:



description:


tags: