sechilds + statistics 79
simulation - Bootstrap vs other simulated data methods - Statistical Analysis
15 days ago by sechilds
Can you answer this? bootstrap vs other simulated data methods #bootstrap
In the mixed effect model, many statisticians would like to simulate or bootstrap data to create empirical confidence regions for fixed effect parameters and random effect parameters.
Resampling (ie bootstapping) seems intuitive for me because it makes few assumptions about the nature of the data.
As an alternative, some identify the multivariate distribution of a set of variables and draw at random from that distribution.
My question is: Is there a principle where one would decide between one of these approaches? Is one of them always better?
To bootstrap in a mixed effects linear model you would do sampling with replacement in a way that maintains the model structure. So your data is divided into groups and you don't want to mix the data from one group into the data from another. For any particular group say you have m observations then you would sample m times with replacement from those m observations. You repeat this process with all the other groups (but the value for m may change). Once you have done this you have a bootstrap sample. You fit the model to this bootstrap sample and then repeat the bootstrapping followed by model fitting many times. This will give you a collection of estimated model parameters (a histogram for each if you will). Any time you have a bootstrap histogram of estimates you can construct bootstrap confidence intervals from this collection of estimates. The simplest is Efron's percentile method which takes the 2.5 percentile and the 97.5 percentile from these ordered bootstrap estimate to be the endpoint of a 95% confidence interval. For more detail on this you can read Efron and Tibshirani's An Introduction to Bootstrap (1993) Chapman and Hall, my book Bootstrap Methods 2nd ed (2007) Wiley or the article by Efron and Tibshirani in Statistical Science (1986).
Now in the absence of data you may want to get an understanding of how the model works. then you can do simulation of the data and look at the results in a way similar to what I described for the bootstrap. The difference is that instead of sampling from the empirical distribution for the data you have to specify a distribution or distributions whenever you do the sampling.
statistics
statistics:bootstrap
from twitter_favs
In the mixed effect model, many statisticians would like to simulate or bootstrap data to create empirical confidence regions for fixed effect parameters and random effect parameters.
Resampling (ie bootstapping) seems intuitive for me because it makes few assumptions about the nature of the data.
As an alternative, some identify the multivariate distribution of a set of variables and draw at random from that distribution.
My question is: Is there a principle where one would decide between one of these approaches? Is one of them always better?
To bootstrap in a mixed effects linear model you would do sampling with replacement in a way that maintains the model structure. So your data is divided into groups and you don't want to mix the data from one group into the data from another. For any particular group say you have m observations then you would sample m times with replacement from those m observations. You repeat this process with all the other groups (but the value for m may change). Once you have done this you have a bootstrap sample. You fit the model to this bootstrap sample and then repeat the bootstrapping followed by model fitting many times. This will give you a collection of estimated model parameters (a histogram for each if you will). Any time you have a bootstrap histogram of estimates you can construct bootstrap confidence intervals from this collection of estimates. The simplest is Efron's percentile method which takes the 2.5 percentile and the 97.5 percentile from these ordered bootstrap estimate to be the endpoint of a 95% confidence interval. For more detail on this you can read Efron and Tibshirani's An Introduction to Bootstrap (1993) Chapman and Hall, my book Bootstrap Methods 2nd ed (2007) Wiley or the article by Efron and Tibshirani in Statistical Science (1986).
Now in the absence of data you may want to get an understanding of how the model works. then you can do simulation of the data and look at the results in a way similar to what I described for the bootstrap. The difference is that instead of sampling from the empirical distribution for the data you have to specify a distribution or distributions whenever you do the sampling.
15 days ago by sechilds
Smart Immigration Policy
9 weeks ago by sechilds
Almost everyone is at least a little suspicious of out group members, so it should come as no surprise that most citizens view foreigners who want to live and work in their country with suspicion. Immigrants tend to be blamed for many negative things. They are accused of taking away jobs that citizens believe should be reserved for them and quite a few countries have laws saying a company can only employ a foreign national if they can show that a citizen could not be found to fill. The presence of foreign labor is frequently protested when labor markets are tight and in some countries this has reached the point of violence and mass displacement of immigrants. Immigrants are disproportionately blamed for crime. Some think immigrants come to live off welfare systems at the expense of citizens. Still others see immigrants as a threat to national culture e.g. Muslims are frequently seen as threat to countries with a Christian or secular history, and no doubt visa versa. Then there is outright racism or ethnocentrism.
To be fair suspicion and hostility aren’t the only reactions to immigrants. Some do view them positively. However on balance feelings are negative. So much so that between half and two thirds of the adult population of the USA wish to see immigration reduced from current levels, and only 5-10% wish to see it increased.
Before discussing empirical findings on the various purported pros and cons of immigration, and various intellectual arguments around it, I want to explore what intelligent opinion has to say about the issue.
The General Social Survey has a number of questions around the possible effects of immigrants. For each alternative on each question I tabled the percentage of public support and a Smart Vote score. The Smart Vote score is a measure of how much and in which direction IQ variations are related to an opinion. It is calculated as follows.
immigration
politics
statistics
from instapaper
To be fair suspicion and hostility aren’t the only reactions to immigrants. Some do view them positively. However on balance feelings are negative. So much so that between half and two thirds of the adult population of the USA wish to see immigration reduced from current levels, and only 5-10% wish to see it increased.
Before discussing empirical findings on the various purported pros and cons of immigration, and various intellectual arguments around it, I want to explore what intelligent opinion has to say about the issue.
The General Social Survey has a number of questions around the possible effects of immigrants. For each alternative on each question I tabled the percentage of public support and a Smart Vote score. The Smart Vote score is a measure of how much and in which direction IQ variations are related to an opinion. It is calculated as follows.
9 weeks ago by sechilds
Incorrect
10 weeks ago by sechilds
Essentially, all models are wrong, but some are useful.
— George E. P. Box, Empirical Model-Building and Response Surfaces, 1987
A favorite quote among statisticians.
quotes
statistics
— George E. P. Box, Empirical Model-Building and Response Surfaces, 1987
A favorite quote among statisticians.
10 weeks ago by sechilds
The link between error bars and statistical significance
10 weeks ago by sechilds
Error bars that represent the 95% confidence interval (CI) of a mean are wider than SE error bars -- about twice as wide with large sample sizes and even wider with small sample sizes. If 95% CI error bars do not overlap, you can be sure the difference is statistically significant (P < 0.05). However, the converse is not true -- you may or may not have statistical significance when the 95% confidence intervals overlap.
statistics
10 weeks ago by sechilds
Nonparametric Econometrics: A Primer
10 weeks ago by sechilds
This review is a primer for those who wish to familiarize themselves with nonparametric econometrics. Though the underlying theory for many of these methods can be daunting for some practitioners, this article will demonstrate how a range of nonparametric methods can in fact be deployed in a fairly straightforward manner. Rather than aiming for encyclopedic coverage of the field, we shall restrict attention to a set of touchstone topics while making liberal use of examples for illustrative purposes. We will emphasize settings in which the user may wish to model a dataset comprised of continuous, discrete, or categorical data (nominal or ordinal), or any combination thereof. We shall also consider recent developments in which some of the variables involved may in fact be irrelevant, which alters the behavior of the estimators and optimal bandwidths in a manner that deviates substantially from conventional approaches.
econometrics
statistics
statistics:nonparametric
10 weeks ago by sechilds
liebke/incanter - GitHub
february 2012 by sechilds
Incanter is a Clojure-based, R-like statistical computing and graphics environment for the JVM. At the core of Incanter are the Parallel Colt numerics library, a multithreaded version of Colt, and the JFreeChart charting library, as well as several other Java and Clojure libraries.
The motivation for creating Incanter is to provide a JVM-based statistical computing and graphics platform with R-like semantics and interactive-programming environment. Running on the JVM provides access to the large number of existing Java libraries for data access, data processing, and presentation. Clojure’s seamless integration with Java makes leveraging these libraries much simpler than is possible in R, and Incanter’s R-like semantics makes statistical programming much simpler than is possible in pure Java.
Motivation for a Lisp-based R-like statistical environment can be found in the paper Back to the Future: Lisp as a Base for a Statistical Computing System by Ihaka and Lang (2008). Incanter is also inspired by the now dormant Lisp-Stat (see the special volume in the Journal of Statistical Software on Lisp-Stat: Past, Present, and Future from 2005).
Motivation for a JVM-based Lisp can be found at the Clojure website, and screencasts of several excellent Clojure talks by the language’s creator, Rich Hickey, can be found at clojure.blip.tv.
statistics
The motivation for creating Incanter is to provide a JVM-based statistical computing and graphics platform with R-like semantics and interactive-programming environment. Running on the JVM provides access to the large number of existing Java libraries for data access, data processing, and presentation. Clojure’s seamless integration with Java makes leveraging these libraries much simpler than is possible in R, and Incanter’s R-like semantics makes statistical programming much simpler than is possible in pure Java.
Motivation for a Lisp-based R-like statistical environment can be found in the paper Back to the Future: Lisp as a Base for a Statistical Computing System by Ihaka and Lang (2008). Incanter is also inspired by the now dormant Lisp-Stat (see the special volume in the Journal of Statistical Software on Lisp-Stat: Past, Present, and Future from 2005).
Motivation for a JVM-based Lisp can be found at the Clojure website, and screencasts of several excellent Clojure talks by the language’s creator, Rich Hickey, can be found at clojure.blip.tv.
february 2012 by sechilds
Decision tree learning - Wikipedia, the free encyclopedia
february 2012 by sechilds
Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels.
In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. This page deals with decision trees in data mining.
statistics
data:visualization
In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. This page deals with decision trees in data mining.
february 2012 by sechilds
Should Machine Learning SE be merged with CrossValidated? - Statistical Analysis Meta - Stack Exchange
february 2012 by sechilds
Should Machine Learning SE be merged with CrossValidated?
statistics
Stack_Exchange
CrossValidated
machine_learning
from twitter_favs
february 2012 by sechilds
How To Build a Naive Bayes Classifier
february 2012 by sechilds
In general you can do a lot better with more specialized techniques, however the Naive Bayes classifier is general-purpose, simple to implement and good-enough for most applications. And while other algorithms give better accuracy, in general I discovered that having better data in combination with an algorithm that you can tweak does give better results for less effort.
In this article I'm describing the math behind it. Don't fear the math, as this is simple enough that a high-schooler understands. And even though there are a lot of libraries out there that already do this, you're far better off for understanding the concept behind it, otherwise you won't be able to tweak the implementation in response to your needs.
bayes
probability
ruby
statistics
In this article I'm describing the math behind it. Don't fear the math, as this is simple enough that a high-schooler understands. And even though there are a lot of libraries out there that already do this, you're far better off for understanding the concept behind it, otherwise you won't be able to tweak the implementation in response to your needs.
february 2012 by sechilds
Effect size - Wikipedia, the free encyclopedia
january 2012 by sechilds
Cohen's d is defined as the difference between two means divided by a standard deviation for the data
Cohen's d is frequently used in estimating sample sizes. A lower Cohen's d indicates a necessity of larger sample sizes, and vice versa, as can subsequently be determined together with the additional parameters of desired significance level and statistical power.[9]
>>> The reason it's not taught in econometrics is that it's BORING!!!
statistics
Cohen's d is frequently used in estimating sample sizes. A lower Cohen's d indicates a necessity of larger sample sizes, and vice versa, as can subsequently be determined together with the additional parameters of desired significance level and statistical power.[9]
>>> The reason it's not taught in econometrics is that it's BORING!!!
january 2012 by sechilds
Propensity score matching - Wikipedia, the free encyclopedia
january 2012 by sechilds
In the statistical analysis of observational data, propensity score matching (PSM) is a methodology attempting to provide unbiased estimation of treatment-effects. The possibility of "bias" arises here because the effectiveness of a treatment may depend on characteristics that are associated with whether or not a participant in an observational study chooses, or is chosen, to receive a given treatment.
A treatment-effect is just jargon for the effect of something that is being studied -- like the consequences of smoking or the consequences of going to university. The people 'treated' are simply those -- the smokers, or the university graduates -- who undergo whatever it is that is being studied by the researcher. The language of 'treatment effects' comes originally from the medical literature where medical researchers have always hoped to isolate the true causal effects of different ways of dealing with disease. One way to do that is to run experiments.
In randomized experiments, the randomization enables unbiased estimation of treatment effects; for each covariate, randomization implies that treatment-groups will be balanced on average, by the law of large numbers. Unfortunately, for observational studies, the assignment of treatments to research subjects has been haphazard and not randomized; lacking randomization, observational studies frequently provide biased estimation of treatment effects and have imbalance on covariates.
In observational studies, the "treatment"-groups (or "exposure" groups) often exhibit imbalance on covariates. This covariate imbalance is confounded with treatments: It is difficult to attribute differences in responses to the "treatment" or "exposure" because the covariates are also believed to influence the response. The propensity score matching attempts to reduce the confounding effects of covariates, and so allow differences of responses to be attributed to differences of treatments (exposures).
Researchers try to decide how the world works, and, in particular, what causes what. To do this properly, it is not enough to observe correlations. It is necessary to try to understand causality. In other words, even when it is seen that smoking and cancer tend to occur together in a sample of human beings, the basic questions remain:
is the smoking itself having a causal effect, or could it just be that cancer is caused by, say, poor diet and those who smoke tend not to eat healthy foods, or
is some gene that leads to cancer and also by coincidence that that gene increases a person's enjoyment of cigarettes?
Similarly, people with university degrees tend later in their lives to earn more money than others without degrees, but is that because the education is actually causing the higher earnings?
statistics
A treatment-effect is just jargon for the effect of something that is being studied -- like the consequences of smoking or the consequences of going to university. The people 'treated' are simply those -- the smokers, or the university graduates -- who undergo whatever it is that is being studied by the researcher. The language of 'treatment effects' comes originally from the medical literature where medical researchers have always hoped to isolate the true causal effects of different ways of dealing with disease. One way to do that is to run experiments.
In randomized experiments, the randomization enables unbiased estimation of treatment effects; for each covariate, randomization implies that treatment-groups will be balanced on average, by the law of large numbers. Unfortunately, for observational studies, the assignment of treatments to research subjects has been haphazard and not randomized; lacking randomization, observational studies frequently provide biased estimation of treatment effects and have imbalance on covariates.
In observational studies, the "treatment"-groups (or "exposure" groups) often exhibit imbalance on covariates. This covariate imbalance is confounded with treatments: It is difficult to attribute differences in responses to the "treatment" or "exposure" because the covariates are also believed to influence the response. The propensity score matching attempts to reduce the confounding effects of covariates, and so allow differences of responses to be attributed to differences of treatments (exposures).
Researchers try to decide how the world works, and, in particular, what causes what. To do this properly, it is not enough to observe correlations. It is necessary to try to understand causality. In other words, even when it is seen that smoking and cancer tend to occur together in a sample of human beings, the basic questions remain:
is the smoking itself having a causal effect, or could it just be that cancer is caused by, say, poor diet and those who smoke tend not to eat healthy foods, or
is some gene that leads to cancer and also by coincidence that that gene increases a person's enjoyment of cigarettes?
Similarly, people with university degrees tend later in their lives to earn more money than others without degrees, but is that because the education is actually causing the higher earnings?
january 2012 by sechilds
probability - How to interpret coefficients from a logistic regression? - Statistical Analysis - Stack Exchange
january 2012 by sechilds
How to interpret the coeffient of a probability function [Logit] #probability
probability
statistics
from twitter_favs
january 2012 by sechilds
Descriptive statistics, causal inference, and story time « Statistical Modeling, Causal Inference, and Social Science
december 2011 by sechilds
But story time can’t be avoided. On one hand, there are real questions to be answered and real decisions to be made in development economics (and elsewhere), and researchers and policymakers can’t simply sit still and say they can’t do anything because the data aren’t fully persuasive. (Remember the first principle of decision analysis: Not making a decision is itself a decision.)
From the other direction, once you have an interesting quantitative finding, of course you want to understand it, and it makes sense to use all your storytelling skills here. The challenge is to go back and forth between the storytelling and the data. You find some interesting result (perhaps an observational data summary, perhaps an analysis of an experiment or natural experiment), this motivates a story, which in turn suggests some new hypotheses to be studied. Yu-Sung and I were just talking about this today in regard to our article on public opinion about school vouchers.
The question is: How do quantitative analysis and story time fit into the big picture? Mike McGovern writes that he wishes Paul Collier had been more modest in his causal claims, presenting his quantitative findings as “intriguing and counterintuitive correlations” and frankly recognizing that exploration of these correlations requires real-world understanding, not just the rhetoric of hard-headed empiricism.
I agree completely with McGovern–and I endeavor to follow this sort of modesty in presenting the implications of my own applied work–and I think it’s a starting point for Coliier and others. Once they recognize that, indeed, they are in story time, they can think harder about the empirical implications of their stories.
econometrics
statistics
from instapaper
From the other direction, once you have an interesting quantitative finding, of course you want to understand it, and it makes sense to use all your storytelling skills here. The challenge is to go back and forth between the storytelling and the data. You find some interesting result (perhaps an observational data summary, perhaps an analysis of an experiment or natural experiment), this motivates a story, which in turn suggests some new hypotheses to be studied. Yu-Sung and I were just talking about this today in regard to our article on public opinion about school vouchers.
The question is: How do quantitative analysis and story time fit into the big picture? Mike McGovern writes that he wishes Paul Collier had been more modest in his causal claims, presenting his quantitative findings as “intriguing and counterintuitive correlations” and frankly recognizing that exploration of these correlations requires real-world understanding, not just the rhetoric of hard-headed empiricism.
I agree completely with McGovern–and I endeavor to follow this sort of modesty in presenting the implications of my own applied work–and I think it’s a starting point for Coliier and others. Once they recognize that, indeed, they are in story time, they can think harder about the empirical implications of their stories.
december 2011 by sechilds
Freakonomics: What Went Wrong? » American Scientist
december 2011 by sechilds
In our analysis of the Freakonomics approach, we encountered a range of avoidable mistakes, from back-of-the-envelope analyses gone wrong to unexamined assumptions to an uncritical reliance on the work of Levitt’s friends and colleagues. This turns accessibility on its head: Readers must work to discern which conclusions are fully quantitative, which are somewhat data driven and which are purely speculative.
statistics
from instapaper
december 2011 by sechilds
density function - How to formally test for a "break" in a normal (or other) distribution - Statistical Analysis - Stack Exchange
december 2011 by sechilds
How to formally test for a "break" in a normal (or other) distribution #normaldistribution
statistics
from twitter_favs
december 2011 by sechilds
Statisticians don’t program?
october 2011 by sechilds
We're statisticians. We don't program.
— Anonymous statistician
I was talking to a small group of statisticians a few months ago, and someone said that to me when I told them how I go about mucking around with data. It still annoys me just thinking about it. It wasn't that he didn't know how to program — because that's perfectly understandable — but he said it in a way as if programming and statistics were so separate that there was no possible way the two could go together.
Wrong.
Let's set things straight before this silly idea spreads further. Programming and statistics belong together, and you don't have to be a coding genius for it to work.
statistics
from google
— Anonymous statistician
I was talking to a small group of statisticians a few months ago, and someone said that to me when I told them how I go about mucking around with data. It still annoys me just thinking about it. It wasn't that he didn't know how to program — because that's perfectly understandable — but he said it in a way as if programming and statistics were so separate that there was no possible way the two could go together.
Wrong.
Let's set things straight before this silly idea spreads further. Programming and statistics belong together, and you don't have to be a coding genius for it to work.
october 2011 by sechilds
General Social Survey
october 2011 by sechilds
he GSS contains a standard 'core' of demographic, behavioral, and attitudinal questions, plus topics of special interest. Many of the core questions have remained unchanged since 1972 to facilitate time-trend studies as well as replication of earlier findings. The GSS takes the pulse of America, and is a unique and valuable resource. It has tracked the opinions of Americans over the last four decades.
statistics
data
october 2011 by sechilds
The General Social Survey is a great resource
october 2011 by sechilds
Two great things about the General Social Survey are: (1) the data are freely available online, and (2) the same questions have been asked since 1972 so you get a nice long series.
statistics
data
from google
october 2011 by sechilds
data transformation - Why is gender typically coded 0/1 rather than 1/2, for example? - Statistical Analysis - Stack Exchange
october 2011 by sechilds
I understand the logic of coding for data analysis. My question below is on the use of a specific code.
Is there a reason why gender is often coded as 0 for female and 1 for male?
Why is this coding considered 'standard'?
Compare this with Female = 1 and Male = 2. Is there a problem with this coding?
statistics
data
Is there a reason why gender is often coded as 0 for female and 1 for male?
Why is this coding considered 'standard'?
Compare this with Female = 1 and Male = 2. Is there a problem with this coding?
october 2011 by sechilds
survey - Likert scale rescaling - Statistical Analysis - Stack Exchange
october 2011 by sechilds
I used a -4 to +4 Likert scale in one set of questionnaires and -3 to +3 for the same questionnaires, but gathered respondents in a different way. I want to rescale from -4 to +4 down to -3 to +3 for the second group. How do I do this and secondly is there a referenced/accepted procedure for doing this? Regards S.
Rescaling the variable as if it were continuous is easy enough. If it's centered on zero and the end points represent real extremes you can just just scale something like:
XSCALED=XORIG×MAXSCALED/MAXORIG
That said, as was suggested in the answer to the other question linked in the comment above, scaling Likert scales is, for substantive reasons, not a great idea. Simple scaling with you result in non-integer answers which are impossible on either scale and might distort the substantive meaning that comes from scales where numbers represent real statements (e.g., 2 means "somewhat satisfied" while 0 means "neutral").
I'm generally hesitant to treat Likert scales as continuous in the first place. Look carefully at the distribution of answers on your two different scales to get a sense of any distortion that you have. It may make more substantive sense to do something other than scaling (i.e., collapsing categories in one or more of the scales and then comparing the variables as categorical).
likert
statistics
Rescaling the variable as if it were continuous is easy enough. If it's centered on zero and the end points represent real extremes you can just just scale something like:
XSCALED=XORIG×MAXSCALED/MAXORIG
That said, as was suggested in the answer to the other question linked in the comment above, scaling Likert scales is, for substantive reasons, not a great idea. Simple scaling with you result in non-integer answers which are impossible on either scale and might distort the substantive meaning that comes from scales where numbers represent real statements (e.g., 2 means "somewhat satisfied" while 0 means "neutral").
I'm generally hesitant to treat Likert scales as continuous in the first place. Look carefully at the distribution of answers on your two different scales to get a sense of any distortion that you have. It may make more substantive sense to do something other than scaling (i.e., collapsing categories in one or more of the scales and then comparing the variables as categorical).
october 2011 by sechilds
More bad news: The (mis)reporting of statistical results in psychology journals
september 2011 by sechilds
Another entry in the growing literature on systematic flaws in the scientific research literature.
This time the bad tidings come from Marjan Bakker and Jelte Wicherts, who write:
Around 18% of statistical results in the psychological literature are incorrectly reported. Inconsistencies were more common in low-impact journals than in high-impact journals. Moreover, around 15% of the articles contained at least one statistical conclusion that proved, upon recalculation, to be incorrect; that is, recalculation rendered the previously significant result insignificant, or vice versa. These errors were often in line with researchers’ expectations.
Their research also had a qualitative component:
To obtain a better understanding of the origins of the errors made in the reporting of statistics, we contacted the authors of the articles with errors in the second study and asked them to send us the raw data. Regrettably, only 24% of the authors shared their data, despite our request being quite specific and our assurances that the authors would remain anonymous. . . .
The paper by Bakker and Wicherts features a truly ugly graph (Figure 2) and also breaks a rule by reporting percentages to inappropriate precision (no, you don’t have to categorize 33/113 as “29.2%”), but I’ll forgive them because I like this sort of work. It’s important and represents a lot of effort. Personally, I think Jelte Wicherts, E. J. Wagenmakers, and John Ioannidis are much more deserving of the ASA Founders Award than is, say, I dunno, Ed Wegman?
statistics
from google
This time the bad tidings come from Marjan Bakker and Jelte Wicherts, who write:
Around 18% of statistical results in the psychological literature are incorrectly reported. Inconsistencies were more common in low-impact journals than in high-impact journals. Moreover, around 15% of the articles contained at least one statistical conclusion that proved, upon recalculation, to be incorrect; that is, recalculation rendered the previously significant result insignificant, or vice versa. These errors were often in line with researchers’ expectations.
Their research also had a qualitative component:
To obtain a better understanding of the origins of the errors made in the reporting of statistics, we contacted the authors of the articles with errors in the second study and asked them to send us the raw data. Regrettably, only 24% of the authors shared their data, despite our request being quite specific and our assurances that the authors would remain anonymous. . . .
The paper by Bakker and Wicherts features a truly ugly graph (Figure 2) and also breaks a rule by reporting percentages to inappropriate precision (no, you don’t have to categorize 33/113 as “29.2%”), but I’ll forgive them because I like this sort of work. It’s important and represents a lot of effort. Personally, I think Jelte Wicherts, E. J. Wagenmakers, and John Ioannidis are much more deserving of the ASA Founders Award than is, say, I dunno, Ed Wegman?
september 2011 by sechilds
LMU - Prof. Friedrich Leisch
september 2011 by sechilds
Sweave is a tool that allows to embed the R code for complete data analyses in latex documents. The purpose is to create dynamic reports, which can be updated automatically if data or analysis change. Instead of inserting a prefabricated graph or table into the report, the master document contains the R code necessary to obtain it. When run through R, all data analysis output (tables, graphs, etc.) is created on the fly and inserted into a final latex document. The report can be automatically updated if data or analysis change, which allows for truly reproducible research.
R
statistics
software
LaTeX
september 2011 by sechilds
distributions - How to find a fitted statistical model to a series of data? - Statistical Analysis - Stack Exchange
september 2011 by sechilds
How to find a fitted statistical model to a series of data? #distributions
statistics
from twitter_favs
september 2011 by sechilds
Pollsters advise voters to be wary of polls ahead of possible spring vote - Winnipeg Free Press
february 2011 by sechilds
A good read this morning. Pollsters candidly reveal the flaws in their work to CP's @jmbryden:
polling
politics
statistics
from twitter_favs
february 2011 by sechilds
International Education Statistics: Guide to creating maps with Stata
december 2010 by sechilds
The graphs and maps on this site are created with the Stata statistical package. This article describes how to make maps like those showing Millennium Development Goal regions and UNICEF regions in Stata from a shapefile.
Shapefiles store geographic features and related information and were developed by ESRI for its ArcGIS line of software. The shapefile format is used by many other programs and maps in this format can be downloaded from various sites on the Internet. Another common map format is the MapInfo Interchange Format for use with the MapInfo software. Shapefile data is usually stored in a set of three files (.shp, .shx, .dbf), while MapInfo data is stored in two files (.mif, .mid). Some sources for shapefiles and other data are listed on the website of the U.S. Centers for Disease Control and Prevention (CDC) under "Resources for Creating Public Health Maps." The CDC itself provides shapefiles for all countries with administrative boundaries down to the state level. Please note that these shapefiles are not in the public domain and are intended for use with the CDC's Epi Info software only. Other sources of shapefiles can be found with a Google search.
This guide is divided into two parts. Read part 1 if you have Stata 9 or 10 and part 2 if you have Stata 8. The creation of maps is not supported in older versions of Stata.
maps
statistics
Stata
Shapefiles store geographic features and related information and were developed by ESRI for its ArcGIS line of software. The shapefile format is used by many other programs and maps in this format can be downloaded from various sites on the Internet. Another common map format is the MapInfo Interchange Format for use with the MapInfo software. Shapefile data is usually stored in a set of three files (.shp, .shx, .dbf), while MapInfo data is stored in two files (.mif, .mid). Some sources for shapefiles and other data are listed on the website of the U.S. Centers for Disease Control and Prevention (CDC) under "Resources for Creating Public Health Maps." The CDC itself provides shapefiles for all countries with administrative boundaries down to the state level. Please note that these shapefiles are not in the public domain and are intended for use with the CDC's Epi Info software only. Other sources of shapefiles can be found with a Google search.
This guide is divided into two parts. Read part 1 if you have Stata 9 or 10 and part 2 if you have Stata 8. The creation of maps is not supported in older versions of Stata.
december 2010 by sechilds
Getting arm and lme4 running on the Mac
october 2010 by sechilds
Our "arm" package in R requires Doug Bates's "lme4" which fits multilevel models.
lme4 is currently having some problems on the Mac. But installation on the Mac can be done; it just takes a bit of work.
I have two sets of instructions below.
R
statistics
lme4 is currently having some problems on the Mac. But installation on the Mac can be done; it just takes a bit of work.
I have two sets of instructions below.
october 2010 by sechilds
r4stats.com: R info for SAS, SPSS, and Stata Users
october 2010 by sechilds
R is a free, open source, package for doing data analysis and graphics. It is roughly the equivalent to the "big three" commercial packages: SAS, SPSS and Stata. This web site provides information to help people learn R, especially people who already know another data analysis package.
To the left, you can download the free early version of the book, R for SAS and SPSS Users. Reviews of the book are there too, along with an interview with its author. There you will also find the book's support files, containing over 30 programs written in all three languages, their practice data sets and a file containing corrections and clarifications to the book.
R has thousands of free functions available. So many in fact, that it can be hard deciding which to use. The Comparison Table will help you choose R packages by comparing them to similar SAS and SPSS add-on modules. R is changing very quickly, so check back now and then to check for updates.
Also on the left are the PowerPoint slides from the SPSS Directions conference, November 2008, showing how to run R from within SPSS, and how to add R programs to SPSS menus.
The workshop notes for a two-day workshop based upon the book are there to download. A 1-day version is in the works for the University of Tennessee in May, and for the Joint Statistical Meetings in August.
The newest book, R for Stata Users, finally rolled off the printing press on May 5, 2010. A description of that book and brief bios of Robert Muenchen and Joseph Hilbe are there as well, along with the programs and data sets.
R
Stata
statistics
To the left, you can download the free early version of the book, R for SAS and SPSS Users. Reviews of the book are there too, along with an interview with its author. There you will also find the book's support files, containing over 30 programs written in all three languages, their practice data sets and a file containing corrections and clarifications to the book.
R has thousands of free functions available. So many in fact, that it can be hard deciding which to use. The Comparison Table will help you choose R packages by comparing them to similar SAS and SPSS add-on modules. R is changing very quickly, so check back now and then to check for updates.
Also on the left are the PowerPoint slides from the SPSS Directions conference, November 2008, showing how to run R from within SPSS, and how to add R programs to SPSS menus.
The workshop notes for a two-day workshop based upon the book are there to download. A 1-day version is in the works for the University of Tennessee in May, and for the Joint Statistical Meetings in August.
The newest book, R for Stata Users, finally rolled off the printing press on May 5, 2010. A description of that book and brief bios of Robert Muenchen and Joseph Hilbe are there as well, along with the programs and data sets.
october 2010 by sechilds
Turning pages into data - Statistical Modeling, Causal Inference, and Social Science
august 2010 by sechilds
There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically?
The technique to turn web pages intended for people into structured data sets intended for computers is called "screen scraping." It has just been made easier with a wiki/community http://scraperwiki.com/.
They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It's very good. Here's an example of scraping road accident data or port of London ship arrivals.
You can already find collections of structured data online, examples are Infochimps ("find the world's data"), and Freebase ("An entity graph of people, places and things, built by a community that loves open data."). There's also a repository system for data, TheData ("An open-source application for publishing, citing and
data
statistics
The technique to turn web pages intended for people into structured data sets intended for computers is called "screen scraping." It has just been made easier with a wiki/community http://scraperwiki.com/.
They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It's very good. Here's an example of scraping road accident data or port of London ship arrivals.
You can already find collections of structured data online, examples are Infochimps ("find the world's data"), and Freebase ("An entity graph of people, places and things, built by a community that loves open data."). There's also a repository system for data, TheData ("An open-source application for publishing, citing and
august 2010 by sechilds
Cited reference search time-series « Code and Culture
july 2010 by sechilds
I was recently talking to somebody who suspected an article he wrote 30 years ago was something of a “sleeper hit” and wanted to see an actual time-series. I wrote this little script to read Google Scholar and extract the dates. You have to tell it the Google Scholar serial number for the focal cite and how many pages to collect.
For instance if you search GS for Strang and Soule’s ARS and click where it says “Cited by 493″ you get the URL “http://scholar.google.com/scholar?cites=3071200965662451019&hl=en&as_sdt=2000″. The important part of the URL is the number between “cites=” and “&”. To figure out how many pages to collect divide the number of citations by 10 and round down. So the syntax to scrape for this cite would be:
bash gscholarscrape.sh 3071200965662451019 49
Here’s the time-series for citations to that article
Google
statistics
For instance if you search GS for Strang and Soule’s ARS and click where it says “Cited by 493″ you get the URL “http://scholar.google.com/scholar?cites=3071200965662451019&hl=en&as_sdt=2000″. The important part of the URL is the number between “cites=” and “&”. To figure out how many pages to collect divide the number of citations by 10 and round down. So the syntax to scrape for this cite would be:
bash gscholarscrape.sh 3071200965662451019 49
Here’s the time-series for citations to that article
july 2010 by sechilds
Wanted: Probability distributions for rank orderings - Statistical Modeling, Causal Inference, and Social Science
july 2010 by sechilds
Dietrich Stoyan writes:
I asked the IMS people for an expert in statistics of voting/elections and they wrote me your name. I am a statistician, but never worked in the field voting/elections. It was my son-in-law who asked me for statistical theories in that field.
He posed in particular the following problem:
The aim of the voting is to come to a ranking of c candidates. Every vote is a permutation of these c candidates. The problem is to have probability distributions in the set of all permutations of c elements.
Are there theories for such distributions?
I should be very grateful for a fast answer with hints to literature. (I confess that I do not know your books.)
My reply: Rather than trying to model the ranks directly, I'd recommend modeling a latent continuous outcome which then implies a distribution on ranks, if the ranks are of interest. There are lots of distributions of c-dimensional continuous outcomes. In political science, the usual way to start is to model the pos
statistics
I asked the IMS people for an expert in statistics of voting/elections and they wrote me your name. I am a statistician, but never worked in the field voting/elections. It was my son-in-law who asked me for statistical theories in that field.
He posed in particular the following problem:
The aim of the voting is to come to a ranking of c candidates. Every vote is a permutation of these c candidates. The problem is to have probability distributions in the set of all permutations of c elements.
Are there theories for such distributions?
I should be very grateful for a fast answer with hints to literature. (I confess that I do not know your books.)
My reply: Rather than trying to model the ranks directly, I'd recommend modeling a latent continuous outcome which then implies a distribution on ranks, if the ranks are of interest. There are lots of distributions of c-dimensional continuous outcomes. In political science, the usual way to start is to model the pos
july 2010 by sechilds
wos2tab.pl « Code and Culture
july 2010 by sechilds
One of my grad students is doing some citation network analysis, for which the Python script (and .exe wrapper) wos2pajek is very well-suited. (Since most network packages can read “.net” this is a good idea even if you’re not using Pajek).
However the student is also interested in node level attributes, not just the network. Unfortunately WOS queries are field-tagged which is kind of a pain to work with and the grad student horrified me by expressing the willingness to spend weeks reshaping the data by hand in Excel. (Even in grad school your time is a lot more valuable than that). To get the data into tab-delimited text, I modified an earlier script I wrote for parsing field-tagged IMDb files (in my case business.list but most of the film-level IMDb files are structured similarly). The basic approach is to read a file line-by-line and match its contents by field-tag, saving the contents in a variable named after the tag. Then when you get to the new record delimiter (in this case, a
Pajek
Perl
Python
statistics
However the student is also interested in node level attributes, not just the network. Unfortunately WOS queries are field-tagged which is kind of a pain to work with and the grad student horrified me by expressing the willingness to spend weeks reshaping the data by hand in Excel. (Even in grad school your time is a lot more valuable than that). To get the data into tab-delimited text, I modified an earlier script I wrote for parsing field-tagged IMDb files (in my case business.list but most of the film-level IMDb files are structured similarly). The basic approach is to read a file line-by-line and match its contents by field-tag, saving the contents in a variable named after the tag. Then when you get to the new record delimiter (in this case, a
july 2010 by sechilds
Using grep (or mdfind) to reshape data
april 2010 by sechilds
Sometimes you have cross-class data that’s arranged the opposite of how you want. For instance, suppose I have a bunch of files organized by song, and I’m interested in finding all the song files that mention a particlar radio station, say KIIS- FM. I can run the following command that finds all the song files in my song directory (or its subdirectories) and puts the names of these files in a text file called “kiis.txt”
statistics
april 2010 by sechilds
Handy statistical lexicon - Statistical Modeling, Causal Inference, and Social Science
may 2009 by sechilds
These are all important methods and concepts related to statistics that are not as well known as they should be. I hope that by giving them names, we will make the ideas more accessible to people:
statistics
may 2009 by sechilds
The Regression-Discontinuity Design
september 2008 by sechilds
The regression-discontinuity design. What a terrible name! In everyday language both parts of the term have connotations that are primarily negative. To most people "regression" implies a reversion backwards or a return to some earlier, more primitive state while "discontinuity" suggests an unnatural jump or shift in what might otherwise be a smoother, more continuous process. To a research methodologist, however, the term regression-discontinuity (hereafter labeled "RD") carries no such negative meaning. Instead, the RD design is seen as a useful method for determining whether a program or treatment is effective.
statistics
work
september 2008 by sechilds
Some papers and presentations by Roger Newson
september 2008 by sechilds
A collection of papers by Rogers Newson that deal extensively with Stata.
Stata
statistics
work
september 2008 by sechilds
Descriptive Statistics Excel/Stata
june 2008 by sechilds
These notes are meant to provide a general overview on how to input data in Excel and Stata and how to perform basic data analysis by looking at some descriptive statistics using both programs.
stata
Excel
statistics
work
june 2008 by sechilds
Difficulties in estimating group-level variances and correlations
may 2008 by sechilds
A political scientist wrote in with a question that actually comes up a lot, having to do with hard-to-estimate group-level variance and correlation parameters in multilevel models. The short answer is, when these things are hard to estimate, it's often b
statistics
may 2008 by sechilds
NPWRC :: Statistical Significance Testing
may 2008 by sechilds
Four basic steps constitute statistical hypothesis testing. First, one develops a null hypothesis about some phenomenon or parameter. This null hypothesis is generally the opposite of the research hypothesis, which is what the investigator truly believes
statistics
econometrics
may 2008 by sechilds
Hierarchical Linear Model (HLM)
april 2008 by sechilds
A page of resources on hierarchical linear models.
statistics
work
econometrics
april 2008 by sechilds
The secret weapon
april 2008 by sechilds
An incredibly useful method is to fit a statistical model repeatedly on several different datasets and then display all these estimates together. For example, running a regression on data on each of 50 states (see here as discussed here), or running a reg
economics
statistics
econometrics
april 2008 by sechilds
Researcher incentives and empirical methods
april 2008 by sechilds
Economists are quick to assume opportunistic behavior in almost every walk of life other than our own. Our empirical methods are based on assumptions of human behavior that would not pass muster in any of our models.
economics
statistics
april 2008 by sechilds
Analytics According to Captain Kirk
april 2008 by sechilds
In my seminars, I enjoy teaching analytics because the fun is in finding effective and memorable methods to help people understand the concepts. One of my favorites is an analysis of the Red-Shirt Phenomenon in Star Trek.
startrek
statistics
april 2008 by sechilds
p-values blah blah blah
april 2008 by sechilds
I agree that p-values are a problem, but not quite for the same reasons as Hubbard and Lindsay do.
statistics
april 2008 by sechilds
Enzmann_Software
february 2008 by sechilds
Has a SPSS to Stata converter.
statistics
software
tools
stata
spss
february 2008 by sechilds
http://www.crimsoneditor.com/english/board/CrazyWWWBoard.cgi?db=file9&mode=read&num=1290&page=12&ftype=6&fval=&backdepth=1
february 2008 by sechilds
How to program crimson editor to automatically run Stata do files.
Stata
editor
text
statistics
work
app:text_editor
february 2008 by sechilds
Some notes on text editors for Stata users
february 2008 by sechilds
A good text editor is always very useful. Syntax highlighting is great.
stata
statistics
editor
text
programming
work
app:text_editor
february 2008 by sechilds
The Bayesian Heresy: Religion in America
february 2008 by sechilds
An interesting graph and article that indicates that Americans change religions during their lives.
religion
US
statistics
february 2008 by sechilds
Worldometeres - real time world statistics
september 2007 by sechilds
This site looks really cool, especially how all the numbers are constantly changing. I'm not exactly sure the use of it, but it something to look at from time to time.
imported
statistics
september 2007 by sechilds
related tags
app:text_editor ⊕ art ⊕ bayes ⊕ blog ⊕ cool ⊕ CrossValidated ⊕ cross_validation ⊕ data ⊕ data:visualization ⊕ development ⊕ discrimination ⊕ econometrics ⊕ economics ⊕ editor ⊕ education ⊕ email ⊕ Excel ⊕ freeware ⊕ fun ⊕ Google ⊕ graphics ⊕ imap ⊕ immigration ⊕ imported ⊕ LaTeX ⊕ likert ⊕ machine_learning ⊕ mapping ⊕ maps ⊕ math ⊕ OS_X ⊕ Pajek ⊕ Perl ⊕ politics ⊕ polling ⊕ probability ⊕ programming ⊕ psychogeography ⊕ psychology ⊕ Python ⊕ quotes ⊕ R ⊕ rationaladdiction ⊕ religion ⊕ research ⊕ ruby ⊕ significance ⊕ social ⊕ software ⊕ spss ⊕ Stack_Exchange ⊕ startrek ⊕ stata ⊕ statistics ⊖ statistics:bootstrap ⊕ statistics:nonparametric ⊕ statistics:philosophy ⊕ text ⊕ tools ⊕ US ⊕ via:phnk ⊕ work ⊕Copy this bookmark: