rybesh + statistics 94
Cube
5 weeks ago by rybesh
Cube is a system for collecting timestamped events and deriving metrics. By collecting events rather than metrics, Cube lets you compute aggregate statistics post hoc. It also enables richer analysis, such as quantiles and histograms of arbitrary event sets.
realtime
statistics
5 weeks ago by rybesh
The RDF Data Cube Vocabulary
8 weeks ago by rybesh
There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations. The Data Cube vocabulary is a core foundation which supports extension vocabularies to enable publication of other aspects of statistical data flows.
metadata
standard
data
description
inls520
webinfo
statistics
science
8 weeks ago by rybesh
Beeminder
9 weeks ago by rybesh
Anything you can put a periodic number on works -- weight, pushups, number of cigarettes, or how long it takes you to bike to work. Just answer with your number when Beeminder asks and it will show you your progress and a yellow brick road to follow to stay on track.
If you go off track, you pledge money to stay on the road the next time. If you go off track again, we charge you.
productivity
statistics
If you go off track, you pledge money to stay on the road the next time. If you go off track again, we charge you.
9 weeks ago by rybesh
Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.
12 weeks ago by rybesh
During the past decade has been an explosion in computation and information technology. With it has come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book descibes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting--the first comprehensive treatment of this topic in any book.
This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization and spectral clustering. There is also a chapter on methods for ``wide'' data (italics p bigger than n), including multiple testing and false discovery rates.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie wrote much of the statistical modeling software in S-PLUS and invented principal curves and surfaces. Tibshirani proposed the Lasso and is co-author of the very successful {italics An Introduct ion to the Bootstrap}. Friedman is the co-inventor of many data-mining tools including CART, MARS, and projection pursuit.
statistics
machinelearning
datamining
This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization and spectral clustering. There is also a chapter on methods for ``wide'' data (italics p bigger than n), including multiple testing and false discovery rates.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie wrote much of the statistical modeling software in S-PLUS and invented principal curves and surfaces. Tibshirani proposed the Lasso and is co-author of the very successful {italics An Introduct ion to the Bootstrap}. Friedman is the co-inventor of many data-mining tools including CART, MARS, and projection pursuit.
12 weeks ago by rybesh
Hierarchical modeling and analysis for spatial data - Sudipto Banerjee, Bradley P. Carlin, Alan E. Gelfand - Google Books
february 2012 by rybesh
Among the many uses of hierarchical modeling, their application to the statistical analysis of spatial and spatio-temporal data from areas such as epidemiology And environmental science has proven particularly fruitful. Yet to date, the few books that address the subject have been either too narrowly focused on specific aspects of spatial analysis, or written at a level often inaccessible to those lacking a strong background in mathematical statistics.Hierarchical Modeling and Analysis for Spatial Data is the first accessible, self-contained treatment of hierarchical methods, modeling, and data analysis for spatial and spatio-temporal data. Starting with overviews of the types of spatial data, the data analysis tools appropriate for each, and a brief review of the Bayesian approach to statistics, the authors discuss hierarchical modeling for univariate spatial response data, including Bayesian kriging and lattice (areal data) modeling. They then consider the problem of spatially misaligned data, methods for handling multivariate spatial responses, spatio-temporal models, and spatial survival models. The final chapter explores a variety of special topics, including spatially varying coefficient models.
bayes
space
temporality
modeling
statistics
february 2012 by rybesh
Library Juice » Data Mining
february 2012 by rybesh
Austin et al. point out that the statistical methods that are at the heart of data mining are not able to distinguish real from spurious associations. Data mining employs the automated examination of enormous bodies of data. Its usefulness is thought to be proportional to the size of the data set that it collates; however, as the data set becomes larger and as the number of attributes that serve as potential relata increases, the number of potential relationships increases exponentially. Importantly, the number of spurious associations also increases. With enough data, no significance test will be stringent enough to provide assurance against the kind of results found in Austin et al. What is needed, according to Austin et al. is a “pre-specified plausible hypothesis.” For statistical analysis to be useful, the researcher must begin with a hypothesis, preferably a plausible one, if the research is to be valuable.
What exactly is a pre-specified plausible hypothesis and how can we generate it if data mining can’t do that for us? The question was posed some sixty years ago by the philosopher Nelson Goodman using different terms: Goodman believed that a critical question for epistemology was to distinguish between “projectible and non-projectible hypotheses.” One can more or less replace “pre-specified plausible hypothesis” with Goodman’s term “projectible hypothesis.” According to Goodman, when we seek to understand what hypothesis is (or is not) projectible, we do not come to the problem “empty-headed but with some stock of knowledge” which we use to determine what is (or is not) projectible. Projectible hypotheses will be those which do not conflict with other hypotheses that have been supported in the past. They will commonly use the same terminology of previously supported hypotheses. The terminology appearing in the hypotheses will have become “entrenched” in the language. This goes a long distance toward explaining why we don’t find the link between one’s astrological sign and medical conditions plausible. Twenty-first century Western medicine is not accustomed to linking astrological signs to ailments and so must find any hypothesis that does so implausible.
If Goodman is correct, then data mining is of little use without an historical understanding of the field of science to which the data pertains.
...
Here, we have another argument for allocating library resources to pay for librarians with deep subject expertise. As e-science develops, vendors will make more and more data sets available, regardless of their actual worth to researchers. To effectively choose the data sets that are of value, librarians must have a thorough understanding of the research needs of their patrons. To do this, they must have a deep understanding of the field. Unfortunately, with the excitement swirling around e-science, the mere access to large data sets threatens to become the be-all and end-all in collection management. If we aren’t careful, we may find ourselves with mountains of data from which everything and nothing can be concluded.
datamining
statistics
knowledge
digitalhumanities
libraries
epistemology
What exactly is a pre-specified plausible hypothesis and how can we generate it if data mining can’t do that for us? The question was posed some sixty years ago by the philosopher Nelson Goodman using different terms: Goodman believed that a critical question for epistemology was to distinguish between “projectible and non-projectible hypotheses.” One can more or less replace “pre-specified plausible hypothesis” with Goodman’s term “projectible hypothesis.” According to Goodman, when we seek to understand what hypothesis is (or is not) projectible, we do not come to the problem “empty-headed but with some stock of knowledge” which we use to determine what is (or is not) projectible. Projectible hypotheses will be those which do not conflict with other hypotheses that have been supported in the past. They will commonly use the same terminology of previously supported hypotheses. The terminology appearing in the hypotheses will have become “entrenched” in the language. This goes a long distance toward explaining why we don’t find the link between one’s astrological sign and medical conditions plausible. Twenty-first century Western medicine is not accustomed to linking astrological signs to ailments and so must find any hypothesis that does so implausible.
If Goodman is correct, then data mining is of little use without an historical understanding of the field of science to which the data pertains.
...
Here, we have another argument for allocating library resources to pay for librarians with deep subject expertise. As e-science develops, vendors will make more and more data sets available, regardless of their actual worth to researchers. To effectively choose the data sets that are of value, librarians must have a thorough understanding of the research needs of their patrons. To do this, they must have a deep understanding of the field. Unfortunately, with the excitement swirling around e-science, the mere access to large data sets threatens to become the be-all and end-all in collection management. If we aren’t careful, we may find ourselves with mountains of data from which everything and nothing can be concluded.
february 2012 by rybesh
Statistics 110: Introduction to Probability
january 2012 by rybesh
Statistics 110 (Introduction to Probability), taught at Harvard University by Joe Blitzstein in Fall 2011. Lecture videos, homework, review material, practice exams, and a large collection of practice problems with detailed solutions are provided. This course is an introduction to probability as a language and set of tools for understanding statistics, science, risk, and randomness. The ideas and methods are useful in statistics, science, philosophy, engineering, economics, finance, and everyday life. Topics include the following. Basics: sample spaces and events, conditional probability, Bayes’ Theorem. Random variables and their distributions: cumulative distribution functions, moment generating functions, expectation, variance, covariance, correlation, conditional expectation. Univariate distributions: Normal, t, Binomial, Negative Binomial, Poisson, Beta, Gamma. Multivariate distributions: joint, conditional, and marginal distributions, independence, transformations, Multinomial, Multivariate Normal. Limit theorems: law of large numbers, central limit theorem. Markov chains: transition probabilities, stationary distributions, reversibility, convergence.
statistics
education
january 2012 by rybesh
Detecting Novel Associations in Large Data Sets
december 2011 by rybesh
Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs—far too many to examine manually. If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones?
statistics
relationships
datamining
december 2011 by rybesh
The Effects of Racial Animus on Voting: Evidence Using Google Search Data
november 2011 by rybesh
Traditional surveys struggle to capture socially unacceptable attitudes such as racial
animus. This paper uses Google searches including racially charged language as a proxy
for a local area’s racial animus. I use the Google-search proxy, available for roughly
200 media markets in the United States, to reassess the impact of racial attitudes on
voting for a black candidate in the United States. I compare an area’s racially charged
search volume to its votes for Barack Obama, the 2008 black Democratic presidential
candidate, controlling for its votes for John Kerry, the 2004 white Democratic presidential candidate. Other studies using a similar empirical specification and standard
state-level survey measures of racial attitudes yield little evidence that racial animus
had a major impact in recent U.S. elections. Using the Google-search proxy, I find
significant and robust effects in the 2008 presidential election. The estimates imply
that racial animus in the United States cost Obama three to five percentage points in
the national popular vote in the 2008 election.
statistics
socialscience
methods
search
animus. This paper uses Google searches including racially charged language as a proxy
for a local area’s racial animus. I use the Google-search proxy, available for roughly
200 media markets in the United States, to reassess the impact of racial attitudes on
voting for a black candidate in the United States. I compare an area’s racially charged
search volume to its votes for Barack Obama, the 2008 black Democratic presidential
candidate, controlling for its votes for John Kerry, the 2004 white Democratic presidential candidate. Other studies using a similar empirical specification and standard
state-level survey measures of racial attitudes yield little evidence that racial animus
had a major impact in recent U.S. elections. Using the Google-search proxy, I find
significant and robust effects in the 2008 presidential election. The estimates imply
that racial animus in the United States cost Obama three to five percentage points in
the national popular vote in the 2008 election.
november 2011 by rybesh
Bayesian statistics - Scholarpedia
september 2011 by rybesh
Bayesian statistics is a system for describing epistemological uncertainty using the mathematical language of probability. In the 'Bayesian paradigm,' degrees of belief in states of nature are specified; these are non-negative, and the total belief in all states of nature is fixed to be one. Bayesian statistical methods start with existing 'prior' beliefs, and update these using data to give 'posterior' beliefs, which may be used as the basis for inferential decisions.
bayes
statistics
september 2011 by rybesh
pandas: a python data analysis library — pandas v0.4.0dev documentation
august 2011 by rybesh
pandas is a python package providing convenient data structures for time series, cross-sectional, or any other form of “labeled” data, with tools for building statistical and econometric models.
python
statistics
dataprocessing
analysis
august 2011 by rybesh
ScalaNLP
august 2011 by rybesh
ScalaNLP is a collection of libraries for Natural Language Processing, Machine Learning, and Statistics.
scala
nlp
linearalgebra
statistics
august 2011 by rybesh
MADlib
july 2011 by rybesh
MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data.
database
analytics
datamining
statistics
machinelearning
sql
july 2011 by rybesh
Home page for the book, "Bayesian Data Analysis"
june 2011 by rybesh
This book is intended to have three roles and to serve three associated audi- ences: an introductory text on Bayesian inference starting from first principles, a graduate text on effective current approaches to Bayesian modeling and com- putation in statistics and related fields, and a handbook of Bayesian methods in applied statistics for general users of and researchers in applied statistics.
bayes
statistics
data
analysis
june 2011 by rybesh
Christopher M. Bishop: Pattern Recognition and Machine Learning
june 2011 by rybesh
This leading textbook provides a comprehensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of pattern recognition or machine learning concepts is assumed. This is the first machine learning textbook to include a comprehensive coverage of recent developments such as probabilistic graphical models and deterministic inference methods, and to emphasize a modern Bayesian perspective. It is suitable for courses on machine learning, statistics, computer science, signal processing, computer vision, data mining, and bioinformatics. This hard cover book has 738 pages in full colour, and there are 431 graded exercises (with solutions available below). Extensive support is provided for course instructors.
machinelearning
books
patterns
statistics
datamining
june 2011 by rybesh
Google Books: American English (155 billion words)
may 2011 by rybesh
This interface allows you to search the Google Books data in many ways that are much more advanced than what is possible with the simple Google Books interface. You can search by word, phrase, substring, lemma, part of speech, synonyms, and collocates (nearby words). You can copy the data to other applications for further analysis, which you can't do with the regular Google Books interface. And you can quickly and easily compare the data in two different sections of the corpus (for example, adjectives describing women or art or music in the 1960s-2000s vs the 1870s-1910s).
american
books
corpus
data
statistics
language
may 2011 by rybesh
CRAN - Package SPARQL
may 2011 by rybesh
Load SPARQL result table from an end-point as a data.frame
sparql
R
tools
statistics
visualization
RDF
may 2011 by rybesh
Using Graphs Instead of Tables
march 2011 by rybesh
The extra work required in producing graphs is rewarded by greatly enhanced presentation and communication of empirical results.
charts
graphics
statistics
visualization
march 2011 by rybesh
RStudio
march 2011 by rybesh
RStudio™ is a new integrated development environment (IDE) for R. RStudio combines an intuitive user interface with powerful coding tools to help you get the most out of R.
statistics
tools
march 2011 by rybesh
Deducer - A graphical data analysis system for use with JGR - RForge.net
february 2011 by rybesh
An intuitive, cross-platform graphical data analysis system. It uses menus and dialogs to guide the user efficiently through the data manipulation and analysis process, and has an excel like spreadsheet for easy data frame visualization and editing.
R
statistics
tools
february 2011 by rybesh
Daisy Zhe Wang: BayesStore
january 2011 by rybesh
BayesStore is a novel probabilistic data management architecture built on the principle of handling statistical models and probabilistic inference tools as first-class citizens of the database system. BayesStore represents model and evidence data as relational tables; implements inference algorithms efficiently in SQL; adds probabilistic relational operators to the query engine; optimizes queries with both relational and inference operators. The design goals of BayesStore are: (1) to be able to support efficient query processing over different models compared to the off-the-shelf machine learning libraries; (2) to be able to support extensible API for plugging in new models and inference algorithms; and (3) to be able to scale up to very large data sets.
statistics
bayes
database
machinelearning
january 2011 by rybesh
Tahir Hemphill
december 2010 by rybesh
The Hip-Hop Word Count is a searchable ethnographic database built from the lyrics of over 40,000 Hip-Hop songs from 1979 to present day.
hiphop
digitalhumanities
statistics
december 2010 by rybesh
ARCADE: Literature, the Humanities, and the World
december 2010 by rybesh
...digital media and huge databases have enormous potential for supporting, preserving, and making available for study the kinds of underground knowledges and cultural productions outside the sphere of mainstream print that you're concerned about. This is the insurgent potential of the Internet and digital media--they can bypass established methods of fixation and legitimation of cultural products. But in academia these are subjects of interest to humanists--and sociologists and anthropologists. By contrast, when true disciplinary outsiders like Jean-Baptiste Michel and his team enter the arena of cultural history and cultural studies from the side of science and engineering, they must be looking to legitimate themselves by proving that their approach "works" for subjects that they imagine will be widely recognized as significant.
digitalhumanities
nlp
statistics
critique
december 2010 by rybesh
edwired » Blog Archive » Visualizing Millions of Words
december 2010 by rybesh
...the lesson that I would then focus on with my students is that what they are looking at in such a graph is nothing more or less than the frequency with which a word is used in book (and only books) published over the centuries. While such frequencies do reflect something, it is not clear from one graph just what that something is. So instead of an answer, a graph like this one is a doorway that leads to a room filled with questions, each of which must be answered by the historian before he or she knows something worth knowing.
digitalhumanities
nlp
statistics
december 2010 by rybesh
Works Cited: Google Books Ngrams and the number of words for "snow"
december 2010 by rybesh
There's a certain Words For Snowism in the online Google Books Ngrams tool, the suggestion that the more frequently a word is used, the more important it is in a collective unconscious of which the Google Books data set serves as a convenient index. This importance is not the same thing as significance, in the sense of significant digits or statistical significance; it's not the difference that makes a difference, but rather a psychologized importance--attachment, cathexis. Which is really kind of garbage.
nlp
digitalhumanities
statistics
critique
december 2010 by rybesh
tm - Text Mining Package
october 2010 by rybesh
tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.
The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.
R
textmining
datamining
nlp
tools
statistics
The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.
october 2010 by rybesh
Piwik - Web analytics - Open source
october 2010 by rybesh
Piwik is a downloadable, open source (GPL licensed) real time web analytics software program. It provides you with detailed reports on your website visitors: the search engines and keywords they used, the language they speak, your popular pages… and so much more.
Piwik aims to be an open source alternative to Google Analytics.
analytics
web
opensource
statistics
Piwik aims to be an open source alternative to Google Analytics.
october 2010 by rybesh
Journal of Statistical Software — Show
august 2010 by rybesh
This user guide describes a Python package, PyMC, that allows users to efficiently code a probabilistic model and draw samples from its posterior distribution using Markov chain Monte Carlo techniques.
statistics
tools
python
august 2010 by rybesh
Training Examples Q&A - machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization
june 2010 by rybesh
Where data geeks ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization!
ai
machinelearning
nlp
textanalysis
ir
datamining
search
statistics
infoviz
reference
june 2010 by rybesh
Signs of Neanderthals Mating With Humans - NYTimes.com
may 2010 by rybesh
"...the statistical insights, however informative, do not have the solidity of an archaeological fact."
epistemology
statistics
facts
history
archaeology
may 2010 by rybesh
ScapeToad - cartogram software by the Choros laboratory
february 2010 by rybesh
ScapeToad uses the Gastner/Newman [2004] diffusion-based algorithm to adapt map surfaces to user-defined variables without altering their topological relations.
cartography
maps
statistics
tools
february 2010 by rybesh
UMBEL Ontology Documentation - umbel:withAlignment
august 2009 by rybesh
umbel:withAlignment is used to reify a umbel:isAligned or a umbel:linksConcept property to a calculated or estimated overlap percentage value between the two classes (sets).
semweb
ontology
vocabulary
statistics
august 2009 by rybesh
Maximum Entropy (GA) Model Optimization Package
august 2009 by rybesh
Maximum entropy (aka logistic regression) models are very popular, especially in natural language processing. The software here is an implementation of maximum likelihood and maximum a posterior optimization of the parameters of these models. The algorithms used are much more efficient than the iterative scaling techniques used in almost every other maxent package out there.
research
tools
nlp
statistics
machinelearning
ocaml
logreg
maxent
august 2009 by rybesh
“seeing” the Web and a Karl Pearson citation
july 2009 by rybesh
Over the last couple of years, the social sciences have been increasingly interested in using computer-based tools to analyze the complexity of the social ant farm that is the Web. Issuecrawler was one of the first of such tools and today researchers are indeed using very sophisticated pieces of software to “see” the Web. Sciences-Po, one of these rather strange french institutions that were founded to educate the elite but which now have to increasingly justify their existence by producing research, has recently hired Bruno Latour to head their new médialab, which will most probably head into that very direction. Given Latour’s background (and the fact that Paul Girard, a very competent former colleague at my lab, heads the R&D departement), this should be really very interesting. I do hope that there will be occasion to tackle the most compelling methodological question when in comes to the application of computers (or mathematics in general) to analyzing human life, which is beautifully framed in a rather reluctant statement from 1889 by Karl Pearson, a major figure in the history of statistics:
“Personally I ought to say that there is, in my own opinion, considerable danger in applying the methods of exact science to problems in descriptive science, whether they be problems of heredity or of political economy; the grace and logical accuracy of the mathematical processes are apt to so fascinate the descriptive scientist that he seeks for sociological hypotheses which fit his mathematical reasoning and this without first ascertaining whether the basis of his hypotheses is as broad as that human life to which the theory is to be applied.” cit. in. Stigler, Stephen M.: The History of Statistics. Harvard University Press, 1990 p. 304
actor-network_theory
epistemolgy
network_theory
statistics
from google
“Personally I ought to say that there is, in my own opinion, considerable danger in applying the methods of exact science to problems in descriptive science, whether they be problems of heredity or of political economy; the grace and logical accuracy of the mathematical processes are apt to so fascinate the descriptive scientist that he seeks for sociological hypotheses which fit his mathematical reasoning and this without first ascertaining whether the basis of his hypotheses is as broad as that human life to which the theory is to be applied.” cit. in. Stigler, Stephen M.: The History of Statistics. Harvard University Press, 1990 p. 304
july 2009 by rybesh
Evidence Based Scheduling - Joel on Software
july 2009 by rybesh
You gather evidence, mostly from historical timesheet data, that you feed back into your schedules. What you get is not just one ship date: you get a confidence distribution curve, showing the probability that you will ship on any given date.
statistics
business
planning
development
software
management
july 2009 by rybesh
How to Write a Spelling Corrector
april 2008 by rybesh
A toy spelling corrector that achieves 80 or 90% accuracy at a processing speed of at least 10 words per second.
python
nlp
howto
statistics
april 2008 by rybesh
UNdata
march 2008 by rybesh
An easy to use data access system was developed that meets UNSD’s vision of providing an integrated information resource with current, relevant and reliable statistics free of charge to the global community.
statistics
database
opendata
demographics
development
economics
analysis
archives
government
march 2008 by rybesh
Is the Tipping Point Toast? -- Duncan Watts
january 2008 by rybesh
The ultimate irony of Watts's research is that, if you really buy it, the most effective way to pitch your idea is ... mass marketing.
marketing
research
social
networking
advertising
brands
communication
culture
statistics
january 2008 by rybesh
Phil Spector's Introduction to R
october 2007 by rybesh
All the things you'll ever need to do with R that you'd otherwise spend hours trying to figure out.
R
reference
statistics
howto
october 2007 by rybesh
wikirage: What's hot now on wikipedia
september 2007 by rybesh
This site lists the pages in Wikipedia which are receiving the most edits per unique editor over various periods of time.
wiki
collaboration
statistics
editing
research
tools
september 2007 by rybesh
//re:digg\\ » Blog Archive » *New* Sections & Data Mining
march 2007 by rybesh
"...the notion of quantifying a community’s potential bias is nothing short of remarkable."
journalism
statistics
nlp
quantitative
methods
bias
datamining
election
march 2007 by rybesh
クチコミ評判検索 β版
february 2007 by rybesh
Japanese "blog buzz" index which analyzes blog content to track word of mouth in different domains, from business to fashion to sports.
blog
nlp
statistics
marketing
japan
february 2007 by rybesh
Manifold - Wikipedia, the free encyclopedia
november 2006 by rybesh
A manifold is an abstract mathematical space in which every point has a neighborhood which resembles Euclidean space, but in which the global structure may be more complicated.
math
machinelearning
statistics
november 2006 by rybesh
Wikistats
november 2006 by rybesh
Discusses stats.wikimedia.org, which no longer seems to be running.
wiki
research
statistics
tools
cs294project
november 2006 by rybesh
Wikistats/Measuring Article Quality
november 2006 by rybesh
This article discusses measuring Wikipedia quality in conceptual terms.
wiki
quality
statistics
november 2006 by rybesh
http://tools.wikimedia.de/~interiot/cgi-bin/Tool1/wannabe_kate
november 2006 by rybesh
Of the Wikipedia edit counting tools I've tried, this one seems to work the best.
wiki
research
statistics
tools
cs294project
november 2006 by rybesh
Wikipedia:WikiProject edit counters
november 2006 by rybesh
Overview of the many various tools available for measuring the number of Wikipedia edits a particular user has made.
wiki
research
statistics
tools
cs294project
november 2006 by rybesh
WikiCharts — Top 100 — 11/2006
november 2006 by rybesh
This tool shows the articles from the English Wikipedia that are viewed most.
wiki
research
statistics
tools
cs294project
november 2006 by rybesh
Wikimedia Toolserver
november 2006 by rybesh
This server fosters the development and continuing operation of software tools for the analysis and improvement of the free content of the Wikimedia projects.
wiki
research
statistics
tools
cs294project
november 2006 by rybesh
Category:Research - Meta
november 2006 by rybesh
More Wikipedia statistics. These statistics are less general and more focused on quantitative sociological research.
wiki
research
statistics
cs294project
november 2006 by rybesh
Category:Wikipedia statistics
november 2006 by rybesh
Index of pages listing various statistics that have been collected about Wikipedia use.
wiki
research
statistics
cs294project
november 2006 by rybesh
Statistical Data Mining Tutorials
september 2006 by rybesh
A set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms.
machinelearning
reference
statistics
howto
september 2006 by rybesh
Topic Modeling Toolbox
july 2006 by rybesh
Tools for entity recognition, extraction and linking.
nlp
tools
research
statistics
datamining
analysis
matlab
july 2006 by rybesh
ahhhhhh visualization
july 2006 by rybesh
A dot plot visualization that conveys the number of results obtained from Google search queries for words of the form a{n}h{m}.
search
statistics
language
infoviz
technology
july 2006 by rybesh
Quantitative Research Methods for Information Systems and Management
july 2006 by rybesh
Quantitative methods for data collection and analysis. Research design. Conceptualization, operationalization, measurement.
courses
fall2006
berkeley
SoI
quantitative
methods
statistics
current
july 2006 by rybesh
Where are we? Rise of the Videonet
june 2006 by rybesh
Online video stats and the emergence of genres.
web
video
statistics
genre
june 2006 by rybesh
Million Dollar Blocks
june 2006 by rybesh
New York City and Wichita, KS, are among the many cities in the United States in which the state regularly spends more than one million dollars to incarcerate prisoners who live within a single census block.
infoviz
maps
statistics
government
prison
economics
june 2006 by rybesh
R/SPlus - Python Interface
april 2006 by rybesh
This allows Python programmers unfamiliar with the syntax of R to easily use its functionality and vice versa.
python
R
statistics
april 2006 by rybesh
UC DATA
april 2006 by rybesh
UC DATA is UC Berkeley's principal archive of computerized social science and health statistics information and is a part of the University's Survey Research Center.
berkeley
statistics
quantitative
research
reference
search
april 2006 by rybesh
O'Reilly Network: Analyzing Baseball Stats with R
april 2006 by rybesh
R tutorial using baseball statistics.
howto
R
statistics
april 2006 by rybesh
MediaPost Publications - Points North: Consumers Crave Web-Based TV - 02/15/2006
february 2006 by rybesh
While 25 percent of Internet users are interested in watching downloaded TV shows and movies on their PCs, 38 percent expressed interest in watching that same video on their TVs.
tv
web
video
consumer
statistics
timetags
february 2006 by rybesh
Research Methods Knowledge Base
february 2006 by rybesh
A comprehensive web-based textbook that addresses all of the topics in a typical introductory undergraduate or graduate course in social research methods.
reference
social
research
methods
statistics
february 2006 by rybesh
S Routines for Social Network Analysis in the R Environment
december 2005 by rybesh
This is a fully documented collection of R routines for social network analysis; utilities included range from hierarchical Bayesian modeling of informant accuracy to logistic network regression.
social
networking
analysis
tools
R
statistics
december 2005 by rybesh
Statistical Techniques for Audio and Video Processing
december 2005 by rybesh
The topics include audio and video object recognition, speech recognition, restoration of corrupted video and audio data, and object discovery in audio and video streams.
courses
statistics
audio
video
contentanalysis
december 2005 by rybesh
Statnet
december 2005 by rybesh
Statnet is a software package for social network analysis based on recent advances in the statistical modeling of random graphs. Runs in R.
statistics
social
networking
analysis
tools
december 2005 by rybesh
Octave
december 2005 by rybesh
GNU Octave is a high-level language, primarily intended for numerical computations.
analysis
math
unix
osx
tools
statistics
december 2005 by rybesh
DMA|Stat Spring 2005
december 2005 by rybesh
By adopting the language of software, algorithms and databases (in short, the langugae of computer science), it is possible to characterize works of "new media."
statistics
newmedia
courses
december 2005 by rybesh
Parsing the State of the Union
december 2005 by rybesh
To search for your own words or phrases, or to compare the occurrence of two words in Bush’s State of the Union Addresses, please try the State of the Union Parsing Tool.
politics
political
media
analysis
language
infoviz
speech
statistics
search
december 2005 by rybesh
WWW2006 Workshop - Logging Traces of Web Activity: The Mechanics of Data Collection
december 2005 by rybesh
This one day workshop will examine the trade-offs and challenges inherent to the different logging approaches and provide workshop attendees the opportunity to discuss both previous data collection experiences and upcoming challenges.
web
conference
2006
workshop
statistics
datamining
december 2005 by rybesh
Onlife
december 2005 by rybesh
Onlife is an application for the Mac OS X that observes your every interaction and then creates a personal shoebox of all the web pages you visit, emails you read, documents you write and much more.
attention
statistics
search
tools
osx
december 2005 by rybesh
Designated Emphasis in Communication, Computation and Statistics
november 2005 by rybesh
The DE in Communication, Computation and Statistics enables specialized, multi-disciplinary training and research opportunities in various emerging areas of information technology.
berkeley
statistics
communication
cs
courses
november 2005 by rybesh
Quantitative/Statistical Research Methods in Social Sciences -- Sociology (SOCIOL) C271D
november 2005 by rybesh
Selected topics in quantitative/statistical methods of research in the social sciences and particularly in sociology.
sociology
statistics
quantitative
methods
berkeley
spring2006
courses
november 2005 by rybesh
Enthought Python
october 2005 by rybesh
A Python distribution that comes with even more useful capabilities already installed and ready for use.
python
windows
science
math
statistics
datamining
tools
opensource
code
october 2005 by rybesh
The R Project for Statistical Computing
october 2005 by rybesh
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible.
code
datamining
language
math
opensource
statistics
tools
october 2005 by rybesh
RPy
october 2005 by rybesh
RPy is a very simple, yet robust, Python interface to the R statistical programming language.
python
statistics
code
math
october 2005 by rybesh
Orange
october 2005 by rybesh
Orange is a component-based data mining software. It includes a range of preprocessing, modelling and data exploration techniques.
machinelearning
classification
code
datamining
python
opensource
tools
nlp
statistics
october 2005 by rybesh
Data Mining in Python
october 2005 by rybesh
This is a collection of libraries useful for machine learning and data mining.
python
statistics
machinelearning
nlp
code
opensource
datamining
october 2005 by rybesh
stats.py
october 2005 by rybesh
A collection of statistical functions, ranging from descriptive statistics (mean, median, histograms, variance, skew, kurtosis, etc.) to inferential statistics (t-tests, F-tests, chi-square, etc.).
python
statistics
opensource
code
october 2005 by rybesh
SciPy Scientific Tools for Python
october 2005 by rybesh
SciPy includes modules for graphics and plotting, optimization, integration, special functions, signal and image processing, genetic algorithms, ODE solvers, and others.
python
opensource
code
science
statistics
tools
math
october 2005 by rybesh
related tags
actor-network_theory ⊕ advertising ⊕ ai ⊕ american ⊕ analysis ⊕ analytics ⊕ archaeology ⊕ archives ⊕ attention ⊕ audio ⊕ bayes ⊕ berkeley ⊕ bias ⊕ blog ⊕ books ⊕ brands ⊕ business ⊕ cartography ⊕ charts ⊕ classification ⊕ code ⊕ collaboration ⊕ communication ⊕ conference ⊕ consumer ⊕ contentanalysis ⊕ corpus ⊕ courses ⊕ critique ⊕ cs ⊕ cs294project ⊕ culture ⊕ current ⊕ data ⊕ database ⊕ datamining ⊕ dataprocessing ⊕ demographics ⊕ description ⊕ development ⊕ digitalhumanities ⊕ economics ⊕ editing ⊕ education ⊕ election ⊕ epistemolgy ⊕ epistemology ⊕ facts ⊕ fall2006 ⊕ genre ⊕ government ⊕ graphics ⊕ hiphop ⊕ history ⊕ howto ⊕ image ⊕ infoviz ⊕ inls520 ⊕ ir ⊕ japan ⊕ journalism ⊕ knowledge ⊕ language ⊕ libraries ⊕ linearalgebra ⊕ logreg ⊕ machinelearning ⊕ management ⊕ maps ⊕ marketing ⊕ math ⊕ matlab ⊕ maxent ⊕ media ⊕ metadata ⊕ methods ⊕ modeling ⊕ msmdx ⊕ music ⊕ networking ⊕ network_theory ⊕ newmedia ⊕ nlp ⊕ ocaml ⊕ ontology ⊕ opendata ⊕ opensource ⊕ osx ⊕ p2p ⊕ patterns ⊕ planning ⊕ playlist ⊕ political ⊕ politics ⊕ prison ⊕ productivity ⊕ python ⊕ quality ⊕ quantitative ⊕ questions ⊕ r ⊕ RDF ⊕ realtime ⊕ reference ⊕ relationships ⊕ research ⊕ scala ⊕ science ⊕ search ⊕ semweb ⊕ social ⊕ socialscience ⊕ sociology ⊕ software ⊕ SoI ⊕ space ⊕ sparql ⊕ speech ⊕ spring2006 ⊕ sql ⊕ standard ⊕ statistics ⊖ technology ⊕ temporality ⊕ textanalysis ⊕ textmining ⊕ timetags ⊕ tools ⊕ tv ⊕ unix ⊕ urn:asin:0062731025 ⊕ urn:asin:0486240614 ⊕ urn:asin:158488388X ⊕ video ⊕ visualization ⊕ vocabulary ⊕ web ⊕ webinfo ⊕ wiki ⊕ windows ⊕ wishlist ⊕ workshop ⊕Copy this bookmark: