cshalizi + principal_components 26
Phys. Rev. Lett. 108, 200601 (2012): Number of Relevant Directions in Principal Component Analysis and Wishart Random Matrices
7 days ago by cshalizi
"We compute analytically, for large N, the probability P(N+,N) that a N×N Wishart random matrix has N+ eigenvalues exceeding a threshold Nζ, including its large deviation tails. This probability plays a benchmark role when performing the principal component analysis of a large empirical data set. We find that P(N+,N)≈exp[-βN2ψζ(N+/N)], where β is the Dyson index of the ensemble and ψζ(κ) is a rate function that we compute explicitly in the full range 0≤κ≤1 and for any ζ. The rate function ψζ(κ) displays a quadratic behavior modulated by a logarithmic singularity close to its minimum κ⋆(ζ). This is shown to be a consequence of a phase transition in an associated Coulomb gas problem. The variance Δ(N) of the number of relevant components is also shown to grow universally (independent of ζ) as Δ(N)∼(βπ2)-1lnN for large N."
to:NB
to_read
principal_components
large_deviations
random_matrices
stochastic_processes
high-dimensional_probability
re:g_paper
phase_transitions
7 days ago by cshalizi
[0803.0402] A note on sensitivity of principal component subspaces and the efficient detection of influential observations in high dimensions
8 weeks ago by cshalizi
"In this paper we introduce an influence measure based on second order expansion of the RV and GCD measures for the comparison between unperturbed and perturbed eigenvectors of a symmetric matrix estimator. Example estimators are considered to highlight how this measure compliments recent influence analysis. Importantly, we also show how a sample based version of this measure can be used to accurately and efficiently detect influential observations in practice."
to:NB
principal_components
statistics
to_teach:undergrad-ADA
8 weeks ago by cshalizi
[0810.0944] A Principal Component Analysis for Trees
12 weeks ago by cshalizi
"The active field of Functional Data Analysis (about understanding the variation in a set of curves) has been recently extended to Object Oriented Data Analysis, which considers populations of more general objects. A particularly challenging extension of this set of ideas is to populations of tree-structured objects. We develop an analog of Principal Component Analysis for trees, based on the notion of tree-lines, and propose numerically fast (linear time) algorithms to solve the resulting optimization problems. The solutions we obtain are used in the analysis of a data set of 73 individuals, where each data object is a tree of blood vessels in one person's brain."
to:NB
data_analysis
principal_components
structured_data
12 weeks ago by cshalizi
Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
february 2012 by cshalizi
"We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic.
"We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2) per trial, where n is the dimension of the instances."
to:NB
online_learning
dimension_reduction
machine_learning
learning_theory
warmuth.manfred
principal_components
low-regret_learning
"We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2) per trial, where n is the dimension of the instances."
february 2012 by cshalizi
PLoS ONE: Low Pitched Voices Are Perceived as Masculine and Attractive but Do They Predict Semen Quality in Men?
december 2011 by cshalizi
How does anyone _not_ read this paper and think that they were correlating everything they could until they got a "significant" effect?
--- I am very tempted right now to make this a problem set in ADA, but that's just asking for trouble, yes?
practices_relating_to_the_transmission_of_genetic_information
regression
statistics
bad_data_analysis
via:unfogged
have_read
principal_components
to:blog
--- I am very tempted right now to make this a problem set in ADA, but that's just asking for trouble, yes?
december 2011 by cshalizi
[1111.6201] Learning a Factor Model via Regularized PCA
december 2011 by cshalizi
"We consider the problem of learning a linear factor model with an unknown number of factors. We propose a regularized form of principal component analysis (PCA) and demonstrate through experiments with synthetic and real data the superiority of resulting estimates to those produced by pre-existing factor analysis approaches. We also establish theoretical results that elucidate the manner in which our algorithm corrects biases induced by conventional PCA. An important feature of our algorithm is its computational efficiency, which is close to that of PCA, which enjoys wide use in large part due to its efficiency."
to:NB
factor_analysis
principal_components
statistics
have_read
to_teach:undergrad-ADA
van_roy.benjamin
december 2011 by cshalizi
Population Value Decomposition, a Framework for the Analysis of Image Populations - Journal of the American Statistical Association - 106(495):775
october 2011 by cshalizi
"Images, often stored in multidimensional arrays, are fast becoming ubiquitous in medical and public health research. Analyzing populations of images is a statistical problem that raises a host of daunting challenges. The most significant challenge is the massive size of the datasets incorporating images recorded for hundreds or thousands of subjects at multiple visits. We introduce the population value decomposition (PVD), a general method for simultaneous dimensionality reduction of large populations of massive images. We show how PVD can be seamlessly incorporated into statistical modeling, leading to a new, transparent, and rapid inferential framework. Our PVD methodology was motivated by and applied to the Sleep Heart Health Study, the largest community-based cohort study of sleep containing more than 85 billion observations on thousands of subjects at two visits. This article has supplementary material online." --- Presumably just some form of SVD for higher-dimensional arrays.
to:NB
principal_components
data_analysis
to_read
to_teach:data-mining
to_teach:undergrad-ADA
october 2011 by cshalizi
Practical Approaches to Principal Component Analysis in the Presence of Missing Values
august 2010 by cshalizi
From a quick skim, it looks too advanced to actually teach in 350, but potentially a handy reference.
principal_components
dimension_reduction
to_teach:data-mining
statistics
data_mining
to_teach:undergrad-ADA
august 2010 by cshalizi
Morris L. Eaton, Multivariate Statistics: A Vector Space Approach (Beachwood, Ohio, USA: Institute of Mathematical Statistics, 2007)
february 2010 by cshalizi
"The purpose of this book is to present a version of multivariate statistical theory in which vector space and invariance methods replace, to a large extent, more traditional multivariate methods. The book is a text. Over the past ten years, various versions have been used for graduate multivariate courses at the University of Chicago, the University of Copenhagen, and the University of Minnesota. Designed for a one year lecture course or for independent study, the book contains a full complement of problems and problem solutions."
books:noted
statistics
principal_components
regression
february 2010 by cshalizi
Holmes: Multivariate data analysis: The French way
december 2009 by cshalizi
"This paper presents exploratory techniques for multivariate data, many of them well known to French statisticians and ecologists, but few well understood in North American culture. We present the general framework of duality diagrams which encompasses discriminant analysis, correspondence analysis and principal components, and we show how this framework can be generalized to the regression of graphs on covariates." --- Having now read this, I think I can safely say that only in the land of Bourbaki would anyone think that conventional linear data analysis made more sense if one gave up talking about probability as useless, and focused all the attention on commutative diagrams.
regression
principal_components
data_analysis
linear_algebra
have_read
abstract_algebra
december 2009 by cshalizi
[0908.3400] Decomposing data sets into skewness modes
august 2009 by cshalizi
"We derive the nonlinear equations satisfied by the coefficients of linear combinations that maximize their skewness when their variance is constrained to take a specific value. In order to numerically solve these nonlinear equations we develop a gradient-type flow that preserves the constraint. In combination with the Karhunen-Lo\`eve decomposition this leads to a set of orthogonal modes with maximal skewness. For illustration purposes we apply these techniques to atmospheric data; in this case the maximal-skewness modes correspond to strongly localized atmospheric flows. We show how these ideas can be extended, for example to maximal-flatness modes."
dimension_reduction
data_analysis
principal_components
karhunen-loeve_decomposition
statistics
august 2009 by cshalizi
"Statistical Theory and Methods for Complex, High-Dimensional Data"
june 2009 by cshalizi
Loads of talks.
statistics
machine_learning
model_selection
graphical_models
regression
latent_variables
principal_components
factor_analysis
dimension_reduction
lasso
bioinformatics
track_down_references
via:shivak
june 2009 by cshalizi
Invariant co-ordinate selection
june 2009 by cshalizi
"A general method for exploring multivariate data by comparing different estimates of multivariate scatter is presented. The method is based on the eigenvalue–eigenvector decomposition of one scatter matrix relative to another. In particular, it is shown that the eigenvectors can be used to generate an affine invariant co-ordinate system for the multivariate data. Consequently, we view this method as a method for invariant co-ordinate selection."
statistics
data_analysis
visual_display_of_quantitative_information
principal_components
june 2009 by cshalizi
A Meta-Analysis of Variance Accounted for and Factor Loadings in Exploratory Factor Analysis
may 2009 by cshalizi
Shorter Peterson: Your results look like a factor analysis of pure noise. Have a nice day. (Also, a citation in support of the folk wisdom that factor analysis doesn't work any better as data reduction than simple principal components analysis.)
factor_analysis
statistics
to:NB
to_teach:data-mining
via:moritz-heene
re:g_paper
dimension_reduction
principal_components
to_teach:undergrad-ADA
may 2009 by cshalizi
Applying Discrete PCA in Data Analysis
may 2009 by cshalizi
I heard Alek talk about this at UAI 2004... and then forgot about it completely when I taught data mining. My bad.
to_teach:data-mining
principal_components
independent_components_analysis
statistics
latent_variables
latent_semantic_analysis
to:NB
jakulin.aleks
buntine.wray
may 2009 by cshalizi
The Screens Issue - If You Liked This, Sure to Love That - Winning the Netflix Prize - NYTimes.com
november 2008 by cshalizi
What the ******* ****, Netflix wasn't using singular value decomposition? Can that really be true? (The hope that the report massively misunderstood is the only thing saving this from an "utter_stupidity" tag.)
netflix_prize
data_mining
collaborative_filtering
to_teach:data-mining
principal_components
november 2008 by cshalizi
Novembre and Stephens, "Interpreting principal component analyses of spatial population genetic variation" (Nature Genetics)
may 2008 by cshalizi
"We find that gradients and waves observed in ... maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events."
genetics
human_genetics
statistics
principal_components
spatial_statistics
stepping_stone_model
cavalli-sforza
via:arthegall
bad_data_analysis
to_teach:data-mining
to:NB
to_teach:undergrad-ADA
may 2008 by cshalizi
related tags
abstract_algebra ⊕ bad_data_analysis ⊕ bioinformatics ⊕ books:noted ⊕ buntine.wray ⊕ cavalli-sforza ⊕ collaborative_filtering ⊕ community_discovery ⊕ computational_statistics ⊕ data_analysis ⊕ data_mining ⊕ dimension_reduction ⊕ eigenproblems ⊕ factor_analysis ⊕ genetics ⊕ graphical_models ⊕ have_read ⊕ hierarchical_models ⊕ high-dimensional_probability ⊕ hoff.peter ⊕ human_genetics ⊕ independent_components_analysis ⊕ jakulin.aleks ⊕ karhunen-loeve_decomposition ⊕ large_deviations ⊕ lasso ⊕ latent_semantic_analysis ⊕ latent_variables ⊕ learning_theory ⊕ linear_algebra ⊕ linear_regression ⊕ low-rank_approximation ⊕ low-regret_learning ⊕ machine_learning ⊕ manifold_learning ⊕ mixture_models ⊕ model_selection ⊕ nadler.boaz ⊕ netflix_prize ⊕ numerical_methods ⊕ online_learning ⊕ optimization ⊕ perturbation_theory ⊕ phase_transitions ⊕ practices_relating_to_the_transmission_of_genetic_information ⊕ principal_components ⊖ random_matrices ⊕ random_matrix_theory ⊕ re:g_paper ⊕ regression ⊕ sparsity ⊕ spatial_statistics ⊕ spectral_clustering ⊕ spectral_methods ⊕ statistics ⊕ stepping_stone_model ⊕ stochastic_processes ⊕ structured_data ⊕ text_mining ⊕ to:blog ⊕ to:NB ⊕ to_read ⊕ to_teach:data-mining ⊕ to_teach:undergrad-ADA ⊕ track_down_references ⊕ van_roy.benjamin ⊕ via:arthegall ⊕ via:moritz-heene ⊕ via:shivak ⊕ via:unfogged ⊕ visual_display_of_quantitative_information ⊕ warmuth.manfred ⊕Copy this bookmark: