cshalizi + data_analysis   42

Greetings, Philosophers - Kieran Healy
But what _kind_ of bootstrap? It's clustered data (raters x schools), which raises interesting technical issues!
philosophy  academia  data_analysis  healy.kieran  bootstrap  to_teach:undergrad-ADA 
9 weeks ago by cshalizi
[0810.0944] A Principal Component Analysis for Trees
"The active field of Functional Data Analysis (about understanding the variation in a set of curves) has been recently extended to Object Oriented Data Analysis, which considers populations of more general objects. A particularly challenging extension of this set of ideas is to populations of tree-structured objects. We develop an analog of Principal Component Analysis for trees, based on the notion of tree-lines, and propose numerically fast (linear time) algorithms to solve the resulting optimization problems. The solutions we obtain are used in the analysis of a data set of 73 individuals, where each data object is a tree of blood vessels in one person's brain."
to:NB  data_analysis  principal_components  structured_data 
12 weeks ago by cshalizi
Wolfram Alpha Pro trial « Follow the Data
Other people may not take the same malicious glee in this that I do.
wolfram_alpha  funny:pointed  ai  data_analysis  to:blog 
february 2012 by cshalizi
Is psychological research really as good as medical research? Effect size comparisons between psychology and medicine
"Researchers have looked at comparisons between medical epidemiological research and psychological research using effect size r in an effort to compare relative effects. Often the outcomes of such efforts have demonstrated comparatively low effects for medical epidemiology research in comparison with effect sizes seen in psychology. The conclusion has often been that relatively small effects seen in psychology research are as strong as those found in important epidemiological medical research. The author suggests that many of the calculated effect sizes from medical epidemiological research on which this conclusion has been based are flawed. Specifically, rather than calculating effect sizes for treatment, many results have been for a Treatment Effect × Disease Effect interaction that was irrelevant to the main study hypothesis. A technique for developing a “hypothesis-relevant” effect size r is proposed."
data_analysis  statistics  psychology  epidemiology  evisceration  via:moritz-heene  have_read 
february 2012 by cshalizi
A General Framework for Dimensionality-Reducing Data Visualization Mapping
"In recent years, a wealth of dimension-reduction techniques for data visualization and preprocessing has been established. Nonparametric methods require additional effort for out-of-sample extensions, because they provide only a mapping of a given finite set of points. In this letter, we propose a general view on nonparametric dimension reduction based on the concept of cost functions and properties of the data. Based on this general principle, we transfer nonparametric dimension reduction to explicit mappings of the data manifold such that direct out-of-sample extensions become possible. Furthermore, this concept offers the possibility of investigating the generalization ability of data visualization to new data points. We demonstrate the approach based on a simple global linear mapping, as well as prototype-based local linear mappings. In addition, we can bias the functional form according to given auxiliary information. This leads to explicit supervised visualization mappings with discriminative properties comparable to state-of-the-art approaches."
in_NB  dimension_reduction  visual_display_of_quantitative_information  data_analysis  data_mining  manifold_learning  to_teach:data-mining 
february 2012 by cshalizi
Mahoney: Randomized Algorithms for Matrices and Data
"Randomized algorithms for very large matrix problems have received a great deal of attention in recent years. Much of this work was motivated by problems in large-scale data analysis, largely since matrices are popular structures with which to model data drawn from a wide range of application domains, and this work was performed by individuals from many different research communities. While the most obvious benefit of randomization is that it can lead to faster algorithms, either in worst-case asymptotic theory and/or numerical implementation, there are numerous other benefits that are at least as important. For example, the use of randomization can lead to simpler algorithms that are easier to analyze or reason about when applied in counterintuitive settings; it can lead to algorithms with more interpretable output, which is of interest in applications where analyst time rather than just computational time is of interest; it can lead implicitly to regularization and more robust output; and randomized algorithms can often be organized to exploit modern computational architectures better than classical numerical methods.

"This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis. Throughout this review, an emphasis will be placed on a few simple core ideas that underlie not only recent theoretical advances but also the usefulness of these tools in large-scale data applications. Crucial in this context is the connection with the concept of statistical leverage. This concept has long been used in statistical regression diagnostics to identify outliers; and it has recently proved crucial in the development of improved worst-case matrix algorithms that are also amenable to high-quality numerical implementation and that are useful to domain scientists. This connection arises naturally when one explicitly decouples the effect of randomization in these matrix algorithms from the underlying linear algebraic structure. This decoupling also permits much finer control in the application of randomization, as well as the easier exploitation of domain knowledge.

"Most of the review will focus on random sampling algorithms and random projection algorithms for versions of the linear least-squares problem and the low-rank matrix approximation problem. These two problems are fundamental in theory and ubiquitous in practice. Randomized methods solve these problems by constructing and operating on a randomized sketch of the input matrix A — for random sampling methods, the sketch consists of a small number of carefully-sampled and rescaled columns/rows of A, while for random projection methods, the sketch consists of a small number of linear combinations of the columns/rows of A. Depending on the specifics of the situation, when compared with the best previously-existing deterministic algorithms, the resulting randomized algorithms have worst-case running time that is asymptotically faster; their numerical implementations are faster in terms of clock-time; or they can be implemented in parallel computing environments where existing numerical algorithms fail to run at all. Numerous examples illustrating these observations will be described in detail."
to:NB  data_analysis  linear_regression  computational_complexity 
january 2012 by cshalizi
csvkit 0.4.2 (beta) — csvkit 0.4.2 (beta) documentation
"csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats."
data_analysis  unix  to_teach:statcomp  via:sparkcamp 
january 2012 by cshalizi
[1111.1855] Fr'echet means of curves for signal averaging and application to ECG data analysis
"Signal averaging is the process that consists in computing a mean shape from a set of noisy signals. In the presence of geometric variability in time in the data, the usual Euclidean mean of the raw data yields a mean pattern that does not reflect the typical shape of the observed signals. In this setting, it is necessary to use alignment techniques for a precise synchronization of the signals, and then to average the aligned data to obtain a consistent mean shape. In this paper, we study the numerical performances of Fr'echet means of curves which are extensions of the usual Euclidean mean to spaces endowed with non-Euclidean metrics. This yields a new algorithm for signal averaging without a reference template. We apply this approach to the estimation of a mean heart cycle from ECG records."
to:NB  statistics  data_analysis  visual_display_of_quantitative_information 
november 2011 by cshalizi
[1110.3917] How to Evaluate Dimensionality Reduction? - Improving the Co-ranking Matrix
"The growing number of dimensionality reduction methods available for data visualization has recently inspired the development of quality assessment measures, in order to evaluate the resulting low-dimensional representation independently from a methods' inherent criteria. Several (existing) quality measures can be (re)formulated based on the so-called co-ranking matrix, which subsumes all rank errors (i.e. differences between the ranking of distances from every point to all others, comparing the low-dimensional representation to the original data). The measures are often based on the partioning of the co-ranking matrix into 4 submatrices, divided at the K-th row and column, calculating a weighted combination of the sums of each submatrix. Hence, the evaluation process typically involves plotting a graph over several (or even all possible) settings of the parameter K. Considering simple artificial examples, we argue that this parameter controls two notions at once, that need not necessarily be combined, and that the rectangular shape of submatrices is disadvantageous for an intuitive interpretation of the parameter. We debate that quality measures, as general and flexible evaluation tools, should have parameters with a direct and intuitive interpretation as to which specific error types are tolerated or penalized. Therefore, we propose to replace K with two parameters to control these notions separately, and introduce a differently shaped weighting on the co-ranking matrix. The two new parameters can then directly be interpreted as a threshold up to which rank errors are tolerated, and a threshold up to which the rank-distances are significant for the evaluation. Moreover, we propose a color representation of local quality to visually support the evaluation process for a given mapping, where every point in the mapping is colored according to its local contribution to the overall quality." --- Look at this carefully, and see if it could be taught in data mining (and whether it's worth doing so.)
to:NB  dimension_reduction  statistics  data_analysis  visual_display_of_quantitative_information  to_teach:data-mining 
october 2011 by cshalizi
Population Value Decomposition, a Framework for the Analysis of Image Populations - Journal of the American Statistical Association - 106(495):775
"Images, often stored in multidimensional arrays, are fast becoming ubiquitous in medical and public health research. Analyzing populations of images is a statistical problem that raises a host of daunting challenges. The most significant challenge is the massive size of the datasets incorporating images recorded for hundreds or thousands of subjects at multiple visits. We introduce the population value decomposition (PVD), a general method for simultaneous dimensionality reduction of large populations of massive images. We show how PVD can be seamlessly incorporated into statistical modeling, leading to a new, transparent, and rapid inferential framework. Our PVD methodology was motivated by and applied to the Sleep Heart Health Study, the largest community-based cohort study of sleep containing more than 85 billion observations on thousands of subjects at two visits. This article has supplementary material online." --- Presumably just some form of SVD for higher-dimensional arrays.
to:NB  principal_components  data_analysis  to_read  to_teach:data-mining  to_teach:undergrad-ADA 
october 2011 by cshalizi
Weisfiler-Lehman Graph Kernels
"In this article, we propose a family of efficient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be defined based on this Weisfeiler-Lehman sequence of graphs, including a highly efficient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classification benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis."
in_NB  network_data_analysis  kernel_methods  data_analysis  graph_limits  machine_learning  re:smoothing_adjacency_matrices  to_read 
october 2011 by cshalizi
Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis
The fact that these are "new statistics" for many psychologists, in this day and age, tells us much about the state of the discipline.
books:noted  psychology  data_analysis 
september 2011 by cshalizi
Against between-subjects experiments | Ready-to-hand
I wonder how hard it would be to construct a Simpson's-paradox situation, where the sign of the ATE from the between-subjects experiment was the opposite of that within each subject?
social_science_methodology  experimental_psychology  experimental_design  data_analysis  to:blog 
june 2011 by cshalizi
Principles of Applied Statistics - Academic and Professional Books - Cambridge University Press
"Applied statistics is more than data analysis, but it is easy to lose sight of the big picture. David Cox and Christl Donnelly distil decades of scientific experience into usable principles for the successful application of statistics, showing how good statistical strategy shapes every stage of an investigation. As you advance from research or policy question, to study design, through modelling and interpretation, and finally to meaningful conclusions, this book will be a valuable guide. Over a hundred illustrations from a wide variety of real applications make the conceptual points concrete, illuminating your path and deepening your understanding. This book is essential reading for anyone who makes extensive use of statistical methods in their work."
books:recommended  statistics  data_analysis  to:NB  to_teach:undergrad-ADA  coveted  cox.david_r. 
may 2011 by cshalizi
Sex Differences in Variability in General Intelligence: A New Look at the Old Question
This would make a great mixture-models problem set, if only the data were available, which doesn't seem to be the case.
mental_testing  iq  data_analysis  sex_differences  re:g_paper 
march 2011 by cshalizi
Low frequency cultural noise
Not only is that a great title (as Nick says), but it literally turns out we make the Earth move: "Abnormal cultural seismic noise is observed in the frequency range of 0.01–0.05 Hz. Cultural noise generated by human activities is generally observed in frequencies above 1 Hz, and is greater in the daytime than at night. The low-frequency noise presented in this paper exhibits a characteristic amplitude variation and can be easily identified from time domain seismograms in the frequency range of interest. The amplitude variation is predominantly in the vertical component, but the horizontal components also show variations. Low-frequency noise is markedly periodic, which reinforces its interpretation as cultural noise. Such noise is observed world-wide, but is limited to areas in the vicinity of railways. The amplitude variation in seismograms correlates strongly with railway timetables..."
geology  time_series  fourier_analysis  data_analysis  via:nick-watkins  to_teach  trains 
may 2010 by cshalizi
A dissection of John Gottman's love lab. - By Laurie Abraham - Slate Magazine
This is confused, or at least confusingly written. Is the objection to not evaluating the classifier out of sample? Or that the success of even a very stupid rule should be high (because most couples don't get divorced within five years)? (That would be a valid point, but it's not "base-rate neglect".) Or what?
marriage  classifiers  to_teach:data-mining  data_analysis 
march 2010 by cshalizi
Prefrontal.org » PAPER: How reliable are the results from functional magnetic resonance imaging?
" we take a close look at what is currently known about the reliability of fMRI findings. First, we examine the many factors that influence the quality of acquired fMRI data. We also conduct a review of the existing literature to determine if some measure of agreement has emerged regarding the reliability of fMRI. Finally, we provide commentary on ways to improve fMRI reliability and what questions remain unanswered. Reliability is the foundation on which scientific investigation is based. How reliable are the results from fMRI?"
fmri  data_analysis  to:NB  via:mind-hacks 
march 2010 by cshalizi
Holmes: Multivariate data analysis: The French way
"This paper presents exploratory techniques for multivariate data, many of them well known to French statisticians and ecologists, but few well understood in North American culture. We present the general framework of duality diagrams which encompasses discriminant analysis, correspondence analysis and principal components, and we show how this framework can be generalized to the regression of graphs on covariates." --- Having now read this, I think I can safely say that only in the land of Bourbaki would anyone think that conventional linear data analysis made more sense if one gave up talking about probability as useless, and focused all the attention on commutative diagrams.
regression  principal_components  data_analysis  linear_algebra  have_read  abstract_algebra 
december 2009 by cshalizi
Looking at Data — Crooked Timber
"The real distinction between qualitative and quantitative is not widely appreciated. People think it has something to do with counting versus not counting, but this is a mistake. If the interpretive work necessary to make sense of things is immediately obvious to everyone, it’s qualitative data. If the interpretative work you need to do is immediately obvious only to experts, it’s quantitative data."
data  data_analysis  methodology  funny:academic  funny:because_its_true  healy.kieran  to_teach:data-mining  to_teach:undergrad-ADA 
october 2009 by cshalizi
Sweave
"Sweave is a tool that allows to embed the R code for complete data analyses in latex documents. The purpose is to create dynamic reports, which can be updated automatically if data or analysis change. Instead of inserting a prefabricated graph or table into the report, the master document contains the R code necessary to obtain it. When run through R, all data analysis output (tables, graphs, etc.) is created on the fly and inserted into a final latex document. The report can be automatically updated if data or analysis change, which allows for truly reproducible research."
sweave  R  latex  paper_writing  programming  via:jhofman  where_have_you_been_all_my_life  data_analysis 
august 2009 by cshalizi
[0908.3400] Decomposing data sets into skewness modes
"We derive the nonlinear equations satisfied by the coefficients of linear combinations that maximize their skewness when their variance is constrained to take a specific value. In order to numerically solve these nonlinear equations we develop a gradient-type flow that preserves the constraint. In combination with the Karhunen-Lo\`eve decomposition this leads to a set of orthogonal modes with maximal skewness. For illustration purposes we apply these techniques to atmospheric data; in this case the maximal-skewness modes correspond to strongly localized atmospheric flows. We show how these ideas can be extended, for example to maximal-flatness modes."
dimension_reduction  data_analysis  principal_components  karhunen-loeve_decomposition  statistics 
august 2009 by cshalizi
Estimating Effects and Correlations in Neuroimaging Data
This makes it sound like I'm presenting; but really I see my role as more that of "designated heckler".
fmri  neuroscience  data_analysis  statistics  gigs  experimental_psychology  social_neuroscience 
june 2009 by cshalizi
Invariant co-ordinate selection
"A general method for exploring multivariate data by comparing different estimates of multivariate scatter is presented. The method is based on the eigenvalue–eigenvector decomposition of one scatter matrix relative to another. In particular, it is shown that the eigenvectors can be used to generate an affine invariant co-ordinate system for the multivariate data. Consequently, we view this method as a method for invariant co-ordinate selection."
statistics  data_analysis  visual_display_of_quantitative_information  principal_components 
june 2009 by cshalizi
Data Analysis Using Regression and Multilevel/Hierarchical Models - Gelman and Hill (@Labyrinth)
Maybe the best applied textbook on regression and hierarchical modeling available. Good as an introduction to statistical modeling more generally.
regression  hierarchical_models  statistics  modeling  data_analysis  gelman.andrew  hill.jennifer  books:recommended 
january 2008 by cshalizi

related tags

abstract_algebra  academia  additive_models  ai  allometric_scaling  arthegall  bankruptcy  bayesianism  blattman.chris  books:noted  books:recommended  book_reviews  bootstrap  classifiers  climate_change  climatology  clustering  cognitive_science  collective_cognition  compressed_sensing  computational_complexity  computational_statistics  coveted  cox.david_r.  data  databases  data_analysis  data_mining  debunking  dimension_reduction  econometrics  economics  epidemiology  evisceration  experimental_design  experimental_psychology  fmri  foundations_of_statistics  fourier_analysis  freedman.david_a  funny:academic  funny:because_its_true  funny:pointed  gelman.andrew  genetics  geology  gigs  good-turing_estimation  good.i.j.  good_old_fashioned_ai  graph_limits  guyon.isabelle  have_read  healy.kieran  heavy_tails  hierarchical_models  hill.jennifer  hodrick-prescott_filter  information_retrieval  in_NB  iq  karhunen-loeve_decomposition  kernel_methods  kith_and_kin  latex  lee.ann  linear_algebra  linear_regression  luca.diana  luxburg.ulrike_von  machine_learning  manifold_learning  marriage  mcardle.megan  mental_testing  methodology  model-checking  modeling  natural_history_of_truthiness  network_data_analysis  neuroscience  obituaries  obvious_to_one_skilled_in_the_art  paper_writing  philosophy  philosophy_of_science  political_economy  political_science  principal_components  programming  psychology  r  re:g_paper  re:phil-of-bayes_paper  re:smoothing_adjacency_matrices  re:your_favorite_dsge_sucks  regression  roeder.kathryn  running_dogs_of_reaction  search_engines  self-centered  sex_differences  signal_processing  silver.nathan  smoothing  social_life_of_the_mind  social_neuroscience  social_science_methodology  sociology_of_science  sparsity  spectral_clustering  splines  statistics  structured_data  sweave  the_mechanical_turk_of_the_semantic_web  time_series  to:blog  to:NB  to_read  to_teach  to_teach:data-mining  to_teach:statcomp  to_teach:undergrad-ADA  trains  unix  us_politics  utter_stupidity  via:chl  via:jhofman  via:mind-hacks  via:moritz-heene  via:nick-watkins  via:rocha  via:sparkcamp  violence  visual_display_of_quantitative_information  war  weaver.rhiannon  where_have_you_been_all_my_life  williamson.robert  wolfram_alpha 

Copy this bookmark:



description:


tags: