cshalizi + principal_components   26

Phys. Rev. Lett. 108, 200601 (2012): Number of Relevant Directions in Principal Component Analysis and Wishart Random Matrices
"We compute analytically, for large N, the probability P(N+,N) that a N×N Wishart random matrix has N+ eigenvalues exceeding a threshold Nζ, including its large deviation tails. This probability plays a benchmark role when performing the principal component analysis of a large empirical data set. We find that P(N+,N)≈exp⁡[-βN2ψζ(N+/N)], where β is the Dyson index of the ensemble and ψζ(κ) is a rate function that we compute explicitly in the full range 0≤κ≤1 and for any ζ. The rate function ψζ(κ) displays a quadratic behavior modulated by a logarithmic singularity close to its minimum κ⋆(ζ). This is shown to be a consequence of a phase transition in an associated Coulomb gas problem. The variance Δ(N) of the number of relevant components is also shown to grow universally (independent of ζ) as Δ(N)∼(βπ2)-1ln⁡N for large N."
to:NB  to_read  principal_components  large_deviations  random_matrices  stochastic_processes  high-dimensional_probability  re:g_paper  phase_transitions 
7 days ago by cshalizi
[0803.0402] A note on sensitivity of principal component subspaces and the efficient detection of influential observations in high dimensions
"In this paper we introduce an influence measure based on second order expansion of the RV and GCD measures for the comparison between unperturbed and perturbed eigenvectors of a symmetric matrix estimator. Example estimators are considered to highlight how this measure compliments recent influence analysis. Importantly, we also show how a sample based version of this measure can be used to accurately and efficiently detect influential observations in practice."
to:NB  principal_components  statistics  to_teach:undergrad-ADA 
8 weeks ago by cshalizi
[0810.0944] A Principal Component Analysis for Trees
"The active field of Functional Data Analysis (about understanding the variation in a set of curves) has been recently extended to Object Oriented Data Analysis, which considers populations of more general objects. A particularly challenging extension of this set of ideas is to populations of tree-structured objects. We develop an analog of Principal Component Analysis for trees, based on the notion of tree-lines, and propose numerically fast (linear time) algorithms to solve the resulting optimization problems. The solutions we obtain are used in the analysis of a data set of 73 individuals, where each data object is a tree of blood vessels in one person's brain."
to:NB  data_analysis  principal_components  structured_data 
12 weeks ago by cshalizi
Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
"We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic.
"We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2) per trial, where n is the dimension of the instances."
to:NB  online_learning  dimension_reduction  machine_learning  learning_theory  warmuth.manfred  principal_components  low-regret_learning 
february 2012 by cshalizi
PLoS ONE: Low Pitched Voices Are Perceived as Masculine and Attractive but Do They Predict Semen Quality in Men?
How does anyone _not_ read this paper and think that they were correlating everything they could until they got a "significant" effect?
--- I am very tempted right now to make this a problem set in ADA, but that's just asking for trouble, yes?
practices_relating_to_the_transmission_of_genetic_information  regression  statistics  bad_data_analysis  via:unfogged  have_read  principal_components  to:blog 
december 2011 by cshalizi
[1111.6201] Learning a Factor Model via Regularized PCA
"We consider the problem of learning a linear factor model with an unknown number of factors. We propose a regularized form of principal component analysis (PCA) and demonstrate through experiments with synthetic and real data the superiority of resulting estimates to those produced by pre-existing factor analysis approaches. We also establish theoretical results that elucidate the manner in which our algorithm corrects biases induced by conventional PCA. An important feature of our algorithm is its computational efficiency, which is close to that of PCA, which enjoys wide use in large part due to its efficiency."
to:NB  factor_analysis  principal_components  statistics  have_read  to_teach:undergrad-ADA  van_roy.benjamin 
december 2011 by cshalizi
Population Value Decomposition, a Framework for the Analysis of Image Populations - Journal of the American Statistical Association - 106(495):775
"Images, often stored in multidimensional arrays, are fast becoming ubiquitous in medical and public health research. Analyzing populations of images is a statistical problem that raises a host of daunting challenges. The most significant challenge is the massive size of the datasets incorporating images recorded for hundreds or thousands of subjects at multiple visits. We introduce the population value decomposition (PVD), a general method for simultaneous dimensionality reduction of large populations of massive images. We show how PVD can be seamlessly incorporated into statistical modeling, leading to a new, transparent, and rapid inferential framework. Our PVD methodology was motivated by and applied to the Sleep Heart Health Study, the largest community-based cohort study of sleep containing more than 85 billion observations on thousands of subjects at two visits. This article has supplementary material online." --- Presumably just some form of SVD for higher-dimensional arrays.
to:NB  principal_components  data_analysis  to_read  to_teach:data-mining  to_teach:undergrad-ADA 
october 2011 by cshalizi
Morris L. Eaton, Multivariate Statistics: A Vector Space Approach (Beachwood, Ohio, USA: Institute of Mathematical Statistics, 2007)
"The purpose of this book is to present a version of multivariate statistical theory in which vector space and invariance methods replace, to a large extent, more traditional multivariate methods. The book is a text. Over the past ten years, various versions have been used for graduate multivariate courses at the University of Chicago, the University of Copenhagen, and the University of Minnesota. Designed for a one year lecture course or for independent study, the book contains a full complement of problems and problem solutions."
books:noted  statistics  principal_components  regression 
february 2010 by cshalizi
Holmes: Multivariate data analysis: The French way
"This paper presents exploratory techniques for multivariate data, many of them well known to French statisticians and ecologists, but few well understood in North American culture. We present the general framework of duality diagrams which encompasses discriminant analysis, correspondence analysis and principal components, and we show how this framework can be generalized to the regression of graphs on covariates." --- Having now read this, I think I can safely say that only in the land of Bourbaki would anyone think that conventional linear data analysis made more sense if one gave up talking about probability as useless, and focused all the attention on commutative diagrams.
regression  principal_components  data_analysis  linear_algebra  have_read  abstract_algebra 
december 2009 by cshalizi
[0908.3400] Decomposing data sets into skewness modes
"We derive the nonlinear equations satisfied by the coefficients of linear combinations that maximize their skewness when their variance is constrained to take a specific value. In order to numerically solve these nonlinear equations we develop a gradient-type flow that preserves the constraint. In combination with the Karhunen-Lo\`eve decomposition this leads to a set of orthogonal modes with maximal skewness. For illustration purposes we apply these techniques to atmospheric data; in this case the maximal-skewness modes correspond to strongly localized atmospheric flows. We show how these ideas can be extended, for example to maximal-flatness modes."
dimension_reduction  data_analysis  principal_components  karhunen-loeve_decomposition  statistics 
august 2009 by cshalizi
Invariant co-ordinate selection
"A general method for exploring multivariate data by comparing different estimates of multivariate scatter is presented. The method is based on the eigenvalue–eigenvector decomposition of one scatter matrix relative to another. In particular, it is shown that the eigenvectors can be used to generate an affine invariant co-ordinate system for the multivariate data. Consequently, we view this method as a method for invariant co-ordinate selection."
statistics  data_analysis  visual_display_of_quantitative_information  principal_components 
june 2009 by cshalizi
A Meta-Analysis of Variance Accounted for and Factor Loadings in Exploratory Factor Analysis
Shorter Peterson: Your results look like a factor analysis of pure noise. Have a nice day. (Also, a citation in support of the folk wisdom that factor analysis doesn't work any better as data reduction than simple principal components analysis.)
factor_analysis  statistics  to:NB  to_teach:data-mining  via:moritz-heene  re:g_paper  dimension_reduction  principal_components  to_teach:undergrad-ADA 
may 2009 by cshalizi
Applying Discrete PCA in Data Analysis
I heard Alek talk about this at UAI 2004... and then forgot about it completely when I taught data mining. My bad.
to_teach:data-mining  principal_components  independent_components_analysis  statistics  latent_variables  latent_semantic_analysis  to:NB  jakulin.aleks  buntine.wray 
may 2009 by cshalizi
The Screens Issue - If You Liked This, Sure to Love That - Winning the Netflix Prize - NYTimes.com
What the ******* ****, Netflix wasn't using singular value decomposition? Can that really be true? (The hope that the report massively misunderstood is the only thing saving this from an "utter_stupidity" tag.)
netflix_prize  data_mining  collaborative_filtering  to_teach:data-mining  principal_components 
november 2008 by cshalizi
Novembre and Stephens, "Interpreting principal component analyses of spatial population genetic variation" (Nature Genetics)
"We find that gradients and waves observed in ... maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events."
genetics  human_genetics  statistics  principal_components  spatial_statistics  stepping_stone_model  cavalli-sforza  via:arthegall  bad_data_analysis  to_teach:data-mining  to:NB  to_teach:undergrad-ADA 
may 2008 by cshalizi

related tags

abstract_algebra  bad_data_analysis  bioinformatics  books:noted  buntine.wray  cavalli-sforza  collaborative_filtering  community_discovery  computational_statistics  data_analysis  data_mining  dimension_reduction  eigenproblems  factor_analysis  genetics  graphical_models  have_read  hierarchical_models  high-dimensional_probability  hoff.peter  human_genetics  independent_components_analysis  jakulin.aleks  karhunen-loeve_decomposition  large_deviations  lasso  latent_semantic_analysis  latent_variables  learning_theory  linear_algebra  linear_regression  low-rank_approximation  low-regret_learning  machine_learning  manifold_learning  mixture_models  model_selection  nadler.boaz  netflix_prize  numerical_methods  online_learning  optimization  perturbation_theory  phase_transitions  practices_relating_to_the_transmission_of_genetic_information  principal_components  random_matrices  random_matrix_theory  re:g_paper  regression  sparsity  spatial_statistics  spectral_clustering  spectral_methods  statistics  stepping_stone_model  stochastic_processes  structured_data  text_mining  to:blog  to:NB  to_read  to_teach:data-mining  to_teach:undergrad-ADA  track_down_references  van_roy.benjamin  via:arthegall  via:moritz-heene  via:shivak  via:unfogged  visual_display_of_quantitative_information  warmuth.manfred 

Copy this bookmark:



description:


tags: