cshalizi + two-sample_tests   6

A Kernel Two-Sample Test
"We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distribution-free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests."
in_NB  to_read  hilbert_space  kernel_methods  goodness-of-fit  statistics  concentration_of_measure  probability  two-sample_tests  re:network_differences 
7 weeks ago by cshalizi
Henze : A Multivariate Two-Sample Test Based on the Number of Nearest Neighbor Type Coincidences
"For independent $d$-variate random samples $X_1, cdots, X_{n_1}$ i.i.d. $f(x), Y_1, cdots, Y_{n_2}$ i.i.d. $g(x)$, where the densities $f$ and $g$ are assumed to be continuous a.e., consider the number $T$ of all $k$ nearest neighbor comparisons in which observations and their neighbors belong to the same sample. We show that, if $f = g$ a.e., the limiting (normal) distribution of $T$, as $min(n_1, n_2) rightarrow infty, n_1/(n_1 + n_2) rightarrow tau, 0 < tau < 1$, does not depend on $f$. An omnibus procedure for testing the hypothesis $H_0: f = g$ a.e. is obtained by rejecting $H_0$ for large values of $T$. The result applies to a general distance (generated by a norm on $mathbb{R}^d$) for determining nearest neighbors, and it generalizes to the multisample situation."
to:NB  to_read  statistics  hypothesis_testing  two-sample_tests  re:AoS_project 
february 2012 by cshalizi
[1202.1561] Tree Models for Difference and Change Detection in a Complex Environment
"A new family of tree models is proposed, which we call "differential trees." A differential tree model is constructed from multiple data sets and aims to detect distributional differences between them. The new methodology differs from the existing difference and change detection techniques in its nonparametric nature, model construction from multiple data sets, and applicability to high-dimensional data. Through a detailed study of an arson case in New Zealand, where an individual is known to have been laying vegetation fires within a certain time period, we illustrate how these models can help detect changes in the frequencies of event occurrences and uncover unusual clusters of events in a complex environment."

--- After reading, I think their exposition is needlessly hard to follow, but let me take a stab at it. In an ordinary classification tree, we are interested in the distribution of the class labels Y given the predictors X, i.e., Pr(Y|X), and make splits on X so that (in essence) the conditional entropy H[Y|X] becomes small. This is of course equivalent to making splits so that the divergence of Pr(Y|X) from Pr(Y) is maximized. What they are interested in is not classification but _describing_ how the different classes are distinct, so the relevant distribution is Pr(X|Y), and they want a big divergence between Pr(X) and Pr(X|Y).
to:NB  re:network_differences  statistics  hypothesis_testing  density_estimation  decision_trees  have_read  data_mining  two-sample_tests 
february 2012 by cshalizi
f-Divergence Estimation and Two-Sample Homogeneity Test Under Semiparametric Density-Ratio Models
"A density ratio is defined by the ratio of two probability densities. We study the inference problem of density ratios and apply a semiparametric density-ratio estimator to the two-sample homogeneity test. In the proposed test procedure, the $f$-divergence between two probability densities is estimated using a density-ratio estimator. The $f$ -divergence estimator is then exploited for the two-sample homogeneity test. We derive an optimal estimator of $f$-divergence in the sense of the asymptotic variance in a semiparametric setting, and provide a statistic for two-sample homogeneity test based on the optimal estimator. We prove that the proposed test dominates the existing empirical likelihood score test. Through numerical studies, we illustrate the adequacy of the asymptotic theory for finite-sample inference."
to:NB  statistics  density_estimation  information_theory  hypothesis_testing  two-sample_tests 
february 2012 by cshalizi
Nonparametric Tests for Homogeneity Based on Non-Bipartite Matching
"Given a sequence of observations, has a change occurred in the underlying probability distribution with respect to observation order? This problem of detecting change points arises in a variety of applications including health prognostics for mechanical systems, syndromic disease surveillance in geographically dispersed populations, anomaly detection in information networks, and multivariate process control in general. Detecting change points in high-dimensional settings is challenging, and most change-point methods for multidimensional problems rely upon distributional assumptions or the use of observation history to model probability distributions. We present three new nonparametric statistical tests for heterogeneity based on the combinatorial properties of minimum non-bipartite matching (MNBM). The key idea underlying each of these tests is that if a sequence of independent random observations undergoes a change in distribution—either an abrupt “shift” or a gradual “drift”—a MNBM based on inter-point distances tends to produce pairings that are closer in the sequence labeling than would be the case if the observations were drawn from the same distribution. Our tests follow on the work of Rosenbaum (2005) who used MNBM to derive a simple cross-match test statistic for the two-sample problem based on this idea. Similar ideas are present in the minimum spanning tree (MST) test derived by Friedman and Rafsky (1979, 1981). We extend these approaches by utilizing ensembles of orthogonal MNBMs which greatly increase information extraction from the data, leading to tests that compare favorably to parametric procedures while maintaining level and good power properties across distributions."
to:NB  statistics  hypothesis_testing  density_estimation  change-point_problem  two-sample_tests 
january 2012 by cshalizi

Copy this bookmark:



description:


tags: