Vaguery + statistics 175
Attractive Models - Kieran Healy
27 days ago by Vaguery
"Now, if you write a paper describing negative results—a model where nothing is significant—then you may have a hard time getting it published. In the absence of some specific controversy, negative results are boring. For the same reason, though, if your results just barely cross the threshold of conventional significance, they may stand a disproportionately better chance of getting published than an otherwise quite similar paper where the results just failed to make the threshold. And this is what the graph above shows, for papers published in the American Political Science Review. It’s a histogram of p-values for coefficients in regressions reported in the journal. The dashed line is the conventional threshold for significance. The tall red bar to the right of the dashed line is the number of coefficients that just made it over the threshold, while the short red bar is the number of coefficients that just failed to do so. If there were no bias in the publication process, the shape of the histogram would approximate the right-hand side of a bell curve. The gap between the big and the small red bars is a consequence of two things: the unwillingness of journals to report negative results, and the efforts of authors to search for (and write up) results that cross the conventional threshold."
statistics
academic-culture
publishing
meta-analysis
27 days ago by Vaguery
No, physicians don’t understand screening statistics | The Incidental Economist
4 weeks ago by Vaguery
"So basically,when it comes to saving lives, docs are three times more likely to recommend a screening test based on irrelevant data than they are to recommend it based on relevant data. I’m bracing myself for the hate mail, but this is part of the reason why I’m skeptical that just providing docs with more evidence will change the way they practice. Most docs just aren’t trained to understand this stuff."
medical-culture
healthcare
statistics
probability-theory
planning
4 weeks ago by Vaguery
An algorithm is just an algorithm | Gene Expression | Discover Magazine
4 weeks ago by Vaguery
"Another illustration that knowledge comes not through blind adherence to methods, but human reflection."
algorithms
statistics
storytelling
i-need-the-name-for-this
4 weeks ago by Vaguery
[1203.3353] Solving Structure with Sparse, Randomly-Oriented X-ray Data
9 weeks ago by Vaguery
"Single-particle imaging experiments of biomolecules at x-ray free-electron lasers (XFELs) require processing of hundreds of thousands (or more) of images that contain very few x-rays. Each low-flux image of the diffraction pattern is produced by a single, randomly oriented particle, such as a protein. We demonstrate the feasibility of collecting data at these extremes, averaging only 2.5 photons per frame, where it seems doubtful there could be information about the state of rotation, let alone the image contrast. This is accomplished with an expectation maximization algorithm that processes the low-flux data in aggregate, and without any prior knowledge of the object or its orientation. The versatility of the method promises, more generally, to redefine what measurement scenarios can provide useful signal in the high-noise regime."
structural-biology
image-analysis
crystallography
algorithms
inverse-problems
nudge-targets
statistics
9 weeks ago by Vaguery
[1203.3284] Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data
9 weeks ago by Vaguery
"We study a character-based phylogeny reconstruction problem when an incomplete set of data is given. More specifically, we consider the situation under the directed perfect phylogeny assumption with binary characters in which for some species the states of some characters are missing. Our main object is to give an efficient algorithm to enumerate (or list) all perfect phylogenies that can be obtained when the missing entries are completed. While a simple branch-and-bound algorithm (B&B) shows a theoretically good performance, we propose another approach based on a zero-suppressed binary decision diagram (ZDD). Experimental results on randomly generated data exhibit that the ZDD approach outperforms B&B. We also prove that counting the number of phylogenetic trees consistent with a given data is #P-complete, thus providing an evidence that an efficient random sampling seems hard."
phylogenetics
inverse-problems
genetics
algorithms
statistics
nudge-targets
9 weeks ago by Vaguery
[1203.1975] Warped Functional Regression
10 weeks ago by Vaguery
"A characteristic feature of functional data is the presence of time variability in addition to amplitude variability. The existing functional regression methods do not handle time variability in an explicit and efficient way. In this paper we introduce a functional regression method that incorporates time warping as an intrinsic part of the model. The method achieves good predictive power in a parsimonious way, and allows for unified statistical inference of time and amplitude variability. The properties of the estimators are studied by simulation, and an application to the modeling of ground-level ozone trajectories is presented."
statistics
time-series
modeling
algorithms
10 weeks ago by Vaguery
[1203.1065] Subspace clustering of high-dimensional data: a predictive approach
10 weeks ago by Vaguery
"In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results."
ain't-performance-space
statistics
clustering
cure-for-dimensionality
algorithms
10 weeks ago by Vaguery
[1111.3304] Eigenvector Synchronization, Graph Rigidity and the Molecule Problem
11 weeks ago by Vaguery
"The graph realization problem has received a great deal of attention in recent years, due to its importance in applications such as wireless sensor networks and structural biology.…"
algorithms
statistics
structure
learning-from-data
nudge-targets
11 weeks ago by Vaguery
[1003.5956] Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms
11 weeks ago by Vaguery
"…In this paper, we introduce a replay method- ology for contextual bandit algorithm evaluation. Different from simulator-based approaches, our method is completely data-driven and very easy to adapt to different applications. More importantly, our method can provide provably unbi- ased evaluations. Our empirical results on a large-scale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show ac- curacy and effectiveness of our offline evaluation method."
classification
recommendations
algorithms
machine-learning
crowdsourcing
nudge-targets
statistics
11 weeks ago by Vaguery
Visualization series: Insight from Cleveland and Tufte on plotting numeric data by groups | Solomon Messing
11 weeks ago by Vaguery
"A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below). Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points. Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive."
visualization
data-analysis
communication
graphic-design
argumentation
statistics
ggplot2
11 weeks ago by Vaguery
[1112.6235] Detecting a Vector Based on Linear Measurements
january 2012 by Vaguery
We consider a situation where the state of a system is represented by a real-valued vector. Under normal circumstances, the vector is zero, while an event manifests as non-zero entries in this vector, possibly few. Our interest is in the design of algorithms that can reliably detect events (i.e., test whether the vector is zero or not) with the least amount of information. We place ourselves in a situation, now common in the signal processing literature, where information about the vector comes in the form of noisy linear measurements. We derive information bounds in an active learning setup and exhibit some simple near-optimal algorithms. In particular, our results show that the task of detection within this setting is at once much easier, simpler and different than the tasks of estimation and support recovery.
signal-processing
statistics
algorithms
nudge-targets
january 2012 by Vaguery
[1109.2215] Finding missing edges and communities in incomplete networks
january 2012 by Vaguery
Many algorithms have been proposed for predicting missing edges in networks, but they do not usually take account of which edges are missing. We focus on networks which have missing edges of the form that is likely to occur in real networks, and compare algorithms that find these missing edges. We also investigate the effect of this kind of missing data on community detection algorithms.
network-theory
algorithms
inference
statistics
nudge-targets
january 2012 by Vaguery
[1010.4735] Exploring the Energy Landscapes of Protein Folding Simulations with Bayesian Computation
january 2012 by Vaguery
Nested sampling is a Bayesian sampling technique developed to explore probability distributions lo- calised in an exponentially small area of the parameter space. The algorithm provides both posterior samples and an estimate of the evidence (marginal likelihood) of the model. The nested sampling algo- rithm also provides an efficient way to calculate free energies and the expectation value of thermodynamic observables at any temperature, through a simple post-processing of the output. Previous applications of the algorithm have yielded large efficiency gains over other sampling techniques, including parallel tempering (replica exchange). In this paper we describe a parallel implementation of the nested sampling algorithm and its application to the problem of protein folding in a Go-type force field of empirical potentials that were designed to stabilize secondary structure elements in room-temperature simulations. We demonstrate the method by conducting folding simulations on a number of small proteins which are commonly used for testing protein folding procedures: protein G, the SH3 domain of Src tyrosine kinase and chymotrypsin inhibitor 2. A topological analysis of the posterior samples is performed to produce energy landscape charts, which give a high level description of the potential energy surface for the protein folding simulations. These charts provide qualitative insights into both the folding process and the nature of the model and force field used.
structural-biology
biochemistry
modeling
algorithms
statistics
metamodeling
january 2012 by Vaguery
[1109.3248] Reconstruction of sequential data with density models
january 2012 by Vaguery
We introduce the problem of reconstructing a sequence of multidimensional real vectors where some of the data are missing. This problem contains regression and mapping inversion as particular cases where the pattern of missing data is independent of the sequence index. The problem is hard because it involves possibly multivalued mappings at each vector in the sequence, where the missing variables can take more than one value given the present variables; and the set of missing variables can vary from one vector to the next. To solve this problem, we propose an algorithm based on two redundancy assumptions: vector redundancy (the data live in a low-dimensional manifold), so that the present variables constrain the missing ones; and sequence redundancy (e.g. continuity), so that consecutive vectors constrain each other. We capture the low-dimensional nature of the data in a probabilistic way with a joint density model, here the generative topographic mapping, which results in a Gaussian mixture. Candidate reconstructions at each vector are obtained as all the modes of the conditional distribution of missing variables given present variables. The reconstructed sequence is obtained by minimising a global constraint, here the sequence length, by dynamic programming. We present experimental results for a toy problem and for inverse kinematics of a robot arm.
inverse-problems
statistics
algorithms
learning-from-data
nudge-targets
january 2012 by Vaguery
[1112.6178] A general framework for online audio source separation
january 2012 by Vaguery
We consider the problem of online audio source separation. Existing algorithms adopt either a sliding block approach or a stochastic gradient approach, which is faster but less accurate. Also, they rely either on spatial cues or on spectral cues and cannot separate certain mixtures. In this paper, we design a general online audio source separation framework that combines both approaches and both types of cues. The model parameters are estimated in the Maximum Likelihood (ML) sense using a Generalised Expectation Maximisation (GEM) algorithm with multiplicative updates. The separation performance is evaluated as a function of the block size and the step size and compared to that of an offline algorithm.
signal-processing
audio-segmentation
statistics
algorithms
metaheuristics
nudge-targets
january 2012 by Vaguery
[1112.0826] Clustering under Perturbation Resilience
january 2012 by Vaguery
Recently, Bilu and Linial formalized an implicit assumption often made when choosing a clustering objective: that the optimum clustering to the objective should be preserved under small multiplicative perturbations to distances between points. They showed that for max-cut clustering it is possible to circumvent NP-hardness and obtain polynomial-time algorithms for instances resilient to large (factor $O(sqrt{n})$) perturbations, and subsequently Awasthi et al. considered center-based objectives, giving algorithms for instances resilient to O(1) factor perturbations.
In this paper, we greatly advance this line of work. For center-based objectives, we present an algorithm that can optimally cluster instances resilient to $(1 + sqrt{2})$-factor perturbations, solving an open problem of Awasthi et al. For a commonly used center-based objective $k$-median, we additionally give algorithms for a more relaxed assumption in which we allow the optimal solution to change in a small $epsilon$ fraction of the points after perturbation. We give the first bounds known for this more realistic and more general setting. We also provide positive results for min-sum clustering which is a generally much harder objective than $k$-median (and also non-center-based). Our algorithms are based on new linkage criteria that may be of independent interest.
Additionally, we give sublinear-time algorithms, showing algorithms that can return an implicit clustering from only access to a small random sample.
clustering
statistics
nonparametric-methods
robustness
resilience
algorithms
nudge-targets
In this paper, we greatly advance this line of work. For center-based objectives, we present an algorithm that can optimally cluster instances resilient to $(1 + sqrt{2})$-factor perturbations, solving an open problem of Awasthi et al. For a commonly used center-based objective $k$-median, we additionally give algorithms for a more relaxed assumption in which we allow the optimal solution to change in a small $epsilon$ fraction of the points after perturbation. We give the first bounds known for this more realistic and more general setting. We also provide positive results for min-sum clustering which is a generally much harder objective than $k$-median (and also non-center-based). Our algorithms are based on new linkage criteria that may be of independent interest.
Additionally, we give sublinear-time algorithms, showing algorithms that can return an implicit clustering from only access to a small random sample.
january 2012 by Vaguery
[1112.5794] BATMAN-an R package for the automated quantification of metabolites from NMR spectra using a Bayesian Model
january 2012 by Vaguery
Motivation: NMR spectra are widely used in metabolomics to obtain metabolite profiles in complex biological mixtures. Common methods used to assign and estimate concentrations of metabolite involve either an expert manual peak fitting or extra pre-processing steps, such as peak alignment and binning. Peak fitting is very time consuming and is subject to human error. Conversely, alignment and binning can introduce artifacts and limit immediate biological interpretation of models. Results: We present the Bayesian AuTomated Metabolite Analyser for NMR spectra (BATMAN), an R package which deconvolves peaks from 1-dimensional NMR spectra, automatically assigns them to specific metabolites and obtains concentration estimates. The Bayesian model incorporates information on characteristic peak patterns of metabolites and is able to account for shifts in the position of peaks commonly seen in NMR spectra of biological samples. It applies a Markov Chain Monte Carlo (MCMC) algorithm to sample from a joint posterior distribution of the model parameters and obtains concentration estimates with reduced mean estimation error compared with conventional numerical integration methods.
learning-from-data
statistics
modeling
biochemistry
nudge-targets
image-segmentation
january 2012 by Vaguery
[1109.5664] Deterministic Feature Selection for $k$-means Clustering
december 2011 by Vaguery
"We study feature selection for $k$-means clustering. Although the literature contains many methods with good empirical performance, algorithms with provable theoretical behavior have only recently been developed. Unfortunately, these algorithms are randomized and fail with, say, a constant probability. We address this issue by presenting a emph{deterministic} feature selection algorithm for $k$-means with theoretical guarantees. At the heart of our algorithm lies a deterministic method for decompositions of the identity."
clustering
statistics
algorithms
nudge-targets
december 2011 by Vaguery
[1107.2379] Data Stability in Clustering: A Closer Look
december 2011 by Vaguery
"This paper considers the model introduced by Bilu and Linial (2010), who study problems for which the optimal clustering does not change when the distances are perturbed by multiplicative factors. They show that even when a problem is NP-hard, it is sometimes possible to obtain polynomial-time algorithms for instances resilient to large perturbations, e.g. on the order of $O(sqrt{n})$ for max-cut clustering. Awasthi et al. (2010) extend this line of work by considering center-based objectives, and Balcan and Liang (2011) consider the $k$-median and min-sum objectives, giving efficient algorithms for instances resilient to certain constant multiplicative perturbations.
Here, we are motivated by the question of to what extent these assumptions can be relaxed while allowing for efficient algorithms. We show there is little room to improve these results by giving NP-hardness lower bounds for both the $k$-median and min-sum objectives. On the other hand, we show that multiplicative resilience parameters, even only on the order of $Theta(1)$, can be so strong as to make the clustering problem trivial, and we exploit these assumptions to present a simple one pass streaming algorithm for the $k$-median objective. We also consider a model of additive perturbations and give a correspondence between additive and multiplicative notions of stability. Our results provide a close examination of the consequences of assuming, even constant, stability in data."
clustering
statistics
algorithms
robustness
nudge-targets
Here, we are motivated by the question of to what extent these assumptions can be relaxed while allowing for efficient algorithms. We show there is little room to improve these results by giving NP-hardness lower bounds for both the $k$-median and min-sum objectives. On the other hand, we show that multiplicative resilience parameters, even only on the order of $Theta(1)$, can be so strong as to make the clustering problem trivial, and we exploit these assumptions to present a simple one pass streaming algorithm for the $k$-median objective. We also consider a model of additive perturbations and give a correspondence between additive and multiplicative notions of stability. Our results provide a close examination of the consequences of assuming, even constant, stability in data."
december 2011 by Vaguery
[1110.0463] A binary noisy channel to model errors in printing process
november 2011 by Vaguery
To model printing noise a binary noisy channel and a set of controlled gates are introduced. The channel input is an image created by a halftoning algorithm and its output is the printed picture. Using this channel robustness to noise between halftoning algorithms can be studied. We introduced relative entropy to describe immunity of the algorithm to noise and tested several halftoning algorithms.
printing
modeling
inverse-problems
simulation
statistics
nudge-targets
november 2011 by Vaguery
[1110.1462] Dynamic Clustering of Histogram Data Based on Adaptive Squared Wasserstein Distances
october 2011 by Vaguery
"…To cluster sets of histogram data, we propose to use Dynamic Clustering Algorithm, (based on adaptive squared Wasserstein distances) that is a k-means-like algorithm for clustering a set of individuals into K classes that are apriori fixed. The main aim of this research is to provide a tool for clustering histograms, emphasizing the different contributions of the histogram variables, and their components, to the definition of the clusters. We demonstrate that this can be achieved using adaptive distances.
Two kind of adaptive distances are considered: the first takes into account the variability of each component of each descriptor for the whole set of individuals; the second takes into account the variability of each component of each descriptor in each cluster. We furnish interpretative tools of the obtained partition based on an extension of the classical measures (indexes) to the use of adaptive distances in the clustering criterion function. Applications on synthetic and real-world data corroborate the proposed procedure."
classification
statistics
histograms
metrics
clustering
Two kind of adaptive distances are considered: the first takes into account the variability of each component of each descriptor for the whole set of individuals; the second takes into account the variability of each component of each descriptor in each cluster. We furnish interpretative tools of the obtained partition based on an extension of the classical measures (indexes) to the use of adaptive distances in the clustering criterion function. Applications on synthetic and real-world data corroborate the proposed procedure."
october 2011 by Vaguery
[1110.0725] A Survey of Distributed Data Aggregation Algorithms
october 2011 by Vaguery
"Distributed data aggregation has been an active field of research in the last decade, and a huge diverse amount of techniques can be found in the literature. For this reasons, this survey intends to be an important time saving instrument, for those that desire to get a quick and comprehensive overview of the state of the art on distributed data aggregation. Moreover, by carefully highlighting the strength and limitations of the more pertinent approaches, this study can provide a useful assistance to help readers choose which technique to apply in specific settings.
Currently, there is no ideal general solution to the distributed computation of an aggregation function, all existing techniques have its pitfalls (some more than others). Therefore, more research in this field will be expected in the next few years. In particular, due to the added value of computing complex aggregates, new algorithms might arise to estimate the statistical distribution of values, as the few existing approaches exhibit some limitations in terms of accuracy and resource consumption. Additional research efforts should be made to improve the support to churn, message loss, and continuous estimation of mutable input values."
statistics
reviews
distributed-processing
communication
coordination
nudge-targets
Currently, there is no ideal general solution to the distributed computation of an aggregation function, all existing techniques have its pitfalls (some more than others). Therefore, more research in this field will be expected in the next few years. In particular, due to the added value of computing complex aggregates, new algorithms might arise to estimate the statistical distribution of values, as the few existing approaches exhibit some limitations in terms of accuracy and resource consumption. Additional research efforts should be made to improve the support to churn, message loss, and continuous estimation of mutable input values."
october 2011 by Vaguery
Even Tiny Bouts of Exercise are Associated with Increased Fitness | Obesity Panacea
june 2011 by Vaguery
"These results are encouraging and suggest that random, short duration physical activity, which may be more feasible and enjoyable for inactive individuals attempting to engage in physical activity for health benefit, is indeed beneficial."
exercise
healthcare
fitness
statistics
june 2011 by Vaguery
Weighty Matters: Is sodium a dietary red herring for the effects of processed foods?
june 2011 by Vaguery
"I think there's at least one more possibility:
3. Sodium's isn't a causal agent of disease but instead given that processed foods are phenomenally high in sodium, is a useful biomarker for the degree of processed foods a person's consuming, and that it's the huge volumes of sugar and pulverized flour (that's more often than not packaged with gobs of sodium) that's actually causal for cardiovascular disease and death."
healthcare
statistics
medical-culture
consumerism
fast-food
3. Sodium's isn't a causal agent of disease but instead given that processed foods are phenomenally high in sodium, is a useful biomarker for the degree of processed foods a person's consuming, and that it's the huge volumes of sugar and pulverized flour (that's more often than not packaged with gobs of sodium) that's actually causal for cardiovascular disease and death."
june 2011 by Vaguery
Doctors are human | The Incidental Economist
june 2011 by Vaguery
"…But this is America. If you want to have the procedure, so be it. You get to choose. That’s the way we roll.
My question is, did your doctor recommend it? Did your doctor tell you about this study? Do you think that those who recommend and perform this procedure don’t know about this study, and that if only they had this evidence they’d stop?
Or, do you think physicians are influenced by biases and their personal beliefs? Me? I think they’re human."
medical-culture
statistics
healthcare
marketing
cognitive-psychology
evidence-based
My question is, did your doctor recommend it? Did your doctor tell you about this study? Do you think that those who recommend and perform this procedure don’t know about this study, and that if only they had this evidence they’d stop?
Or, do you think physicians are influenced by biases and their personal beliefs? Me? I think they’re human."
june 2011 by Vaguery
The distribution of interestingness | (R news & tutorials)
may 2011 by Vaguery
"The longer – and far less satisfying – answer to the question of how interestingness measures should be distributed is, “it depends,” as the following discussion illustrates."
statistics
interestingness
design-of-measures
statisticians-don't-do-Pragmatism-well
learning-from-data
may 2011 by Vaguery
Growing need for data heads
may 2011 by Vaguery
"I've said it before, but if digging into data is your idea of fun, there's a whole mess of excitement and adventure headed your way. There are lots of opportunities already out there in marketing, journalism, tech, the Web, government, and pretty much everywhere you look. And more importantly, there are lots of opportunities that you can make for yourself. This is a great time for data heads."
data-science
data-mining
statistics
jobs
advice
may 2011 by Vaguery
Friday fun projects | (R news & tutorials)
may 2011 by Vaguery
At some point, I’ll turn to my favourite web application combo: Sinatra + MongoDB + Highcharts, to visualize these data dynamically on a web page. For now though, we can get a quick idea and create even more Friday fun by learning how to use RApache to run and view R code in the browser.
Ruby
R-language
visualization
statistics
programming
learning-by-doing
may 2011 by Vaguery
ashleyw/phrasie - GitHub
may 2011 by Vaguery
Determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength.
Ruby
library
tagging
natural-language-processing
NLP
statistics
text-mining
may 2011 by Vaguery
[1102.3220] A signal recovery algorithm for sparse matrix based compressed sensing
april 2011 by Vaguery
"Even when the numbers of non-zero entries per column/row in the measurement matrices are limited to $O(1)$, numerical experiments indicate that the algorithm can still typically recover the original signal perfectly with an $O(N)$ computational cost per update as well if the density $\rho$ of non-zero entries of the signal is lower than a certain critical value $\rho_{\rm th}(\alpha)$ as $N,M \to \infty$."
compressed-sensing
algorithms
signal-processing
nudge-targets
machine-learning
statistics
from delicious
april 2011 by Vaguery
[0807.1271] Semiparametric curve alignment and shift density estimation for biological data
august 2010 by Vaguery
"Assume that we observe a large number of curves, all of them with identical, although unknown, shape, but with a different random shift. The objective is to estimate the individual time shifts and their distribution. Such an objective appears in several biological applications like neuroscience or ECG signal processing, in which the estimation of the distribution of the elapsed time between repetitive pulses with a possibly low signal-noise ratio, and without a knowledge of the pulse shape is of interest. We suggest an M-estimator leading to a three-stage algorithm: we split our data set in blocks, on which the estimation of the shifts is done by minimizing a cost criterion based on a functional of the periodogram; the estimated shifts are then plugged into a standard density estimator. We show that under mild regularity assumptions the density estimate converges weakly to the true shift distribution. The theory is applied both to simulations and to alignment of real ECG signals.…"
data-analysis
statistics
algorithms
heuristics
exploratory-data-analysis
nudge
optimization
classification
time-series
august 2010 by Vaguery
[1008.1414] Statistically validated networks in bipartite complex systems
august 2010 by Vaguery
"Many complex systems present an intrinsic bipartite nature and are often described and modeled in terms of networks [1-5]. Examples include movies and actors [1, 2, 4], authors and scientific papers [6-9], email accounts and emails [10], plants and animals that pollinate them [11, 12]. Bipartite networks are often very heterogeneous in the number of relationships that the elements of one set establish with the elements of the other set. … Here we introduce an unsupervised method to statistically validate each link of the projected network against a null hypothesis taking into account the heterogeneity of the system. We apply our method to three different systems…. In all these systems, both different in size and level of heterogeneity, we find that our method is able to detect network structures which are informative about the system…"
complexology
network-theory
algorithms
machine-learning
nudge-targets
inference
statistics
august 2010 by Vaguery
[1008.1758] Stochastic Data Clustering
august 2010 by Vaguery
"In 1961 Herbert Simon and Albert Ando published the theory behind the long-term behavior of a dynamical system that can be described by a nearly completely decomposable matrix. Over the past fifty years this theory has been used in a variety of contexts, including queueing theory, computer performance, and ecology. In all these applications, the structure of the system is known and the point of interest is the various states the system passes through on its way to some long-term equilibrium. This paper looks at this problem from the other direction. That is, we develop a technique for using the evolution of the system to tell us about its initial structure, and we use this technique to develop a new algorithm for data clustering."
clustering
data-analysis
exploratory-data-analysis
statistics
algorithms
august 2010 by Vaguery
[1007.5516] Variable importance and model selection by decorrelation
august 2010 by Vaguery
"We introduce a simple criterion, the CAR score, for ranking and selecting variables in linear regression. The CAR score arises naturally in the best predictor formulation of the linear model, offers a canonical decomposition of the proportion of explained variance, and also takes account of correlation and grouping structure among explanatory variables. As population quantity the CAR score is not tied to any specific inference paradigm. Variable selection based on AIC, $C_p$, BIC, and other information criteria is shown to be equivalent to thresholding CAR scores at a fixed level, whereas using false discovery rates corresponds to an adaptive cutoff. In computer simulations we show that CAR scores are highly effective for variable selection with a prediction error that compares favorable with the elastic net and similar regression procedures. We illustrate the approach by analyzing diabetes data as well as gene expression data from the human frontal cortex."
statistics
variable-selection
algorithms
information-theory
models
heuristics
august 2010 by Vaguery
[0911.5460] Thresholding-based Iterative Selection Procedures for Generalized Linear Models
august 2010 by Vaguery
"High-dimensional correlated data pose challenges in model selection and predictive learning. In this paper, we derive an iterative thresholding technique for generalized linear models (GLMs) with possibly nonorthogonal designs. We propose a family of $\Theta$-estimators which are associated with penalized likelihoods and can be computed by thresholding-based iterative procedures. It can also be used to robustify GLMs and extend the canonical $M$-estimators.…"
variable-selection
statistics
models
modeling
august 2010 by Vaguery
[1007.5510] An algorithm for the principal component analysis of large data sets
august 2010 by Vaguery
"Recently popularized randomized methods for principal component analysis (PCA) efficiently and reliably produce nearly optimal accuracy - even on parallel processors - unlike the classical (deterministic) alternatives. We adapt one of these randomized methods for use with data sets that are too large to be stored in random-access memory (RAM). (The traditional terminology is that our procedure works efficiently "out-of-core.") We illustrate the performance of the algorithm via several numerical examples. For example, we report on the PCA of a data set stored on disk that is so large that less than a hundredth of it can fit in our computer's RAM."
algorithms
big-data-will-lead-to-big-inference
statistics
data-mining
exploratory-data-analysis
august 2010 by Vaguery
[1007.1075] Clustering Stability: An Overview
august 2010 by Vaguery
"A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications."
statistics
data-analysis
clustering
nonparametric-statistics
exploratory-data-analysis
heuristics
august 2010 by Vaguery
[1007.3254] Distinguishing Fact from Fiction: Pattern Recognition in Texts Using Complex Networks
august 2010 by Vaguery
"We establish concrete mathematical criteria to distinguish between different kinds of written storytelling, fictional and non-fictional. Specifically, we constructed a semantic network from both novels and news stories, with $N$ independent words as vertices or nodes, and edges or links allotted to words occurring within $m$ places of a given vertex; we call $m$ the word distance. We then used measures from complex network theory to distinguish between news and fiction, studying the minimal text length needed as well as the optimized word distance $m$. The literature samples were found to be most effectively represented by their corresponding power laws over degree distribution $P(k)$ and clustering coefficient $C(k)$; we also studied the mean geodesic distance, and found all our texts were small-world networks.…"
nudge-targets
computational-linguistics
linguistics
classification
machine-learning
statistics
natural-language-processing
august 2010 by Vaguery
[1006.5731] A Taxonomy of Networks
july 2010 by Vaguery
"The study of networks has grown into a substantial interdisciplinary endeavor across the natural, social, and information sciences. Yet there have been very few attempts to investigate the interrelatedness of the different classes of networks studied by different disciplines. Here, we introduced a framework to establish a taxonomy of networks from various origins. The provision of this family tree not only helps understand the kinship of networks, but also facilitates the transfer of empirical analysis, theoretical modeling, and conceptual developments across disciplinary boundaries. The framework is based on probing the mesoscopic properties of networks, an important source of heterogeneity for their structure and function. Using our method, we computed a taxonomy for 752 individual networks and a separate taxonomy for 12 network classes. We also computed three within-class taxonomies for political, fungal, and financial networks, and found them to be insightful in each case."
nudge-targets
classification
models
network-theory
statistics
complexology
ontology
taxonomy
july 2010 by Vaguery
[0906.5321] Efficient statistical inference for stochastic reaction processes
july 2010 by Vaguery
"We address the problem of estimating unknown model parameters and state variables in stochastic reaction processes when only sparse and noisy measurements are available. Using an asymptotic system size expansion for the backward equation we derive an efficient approximation for this problem. We demonstrate the validity of our approach on model systems and generalize our method to the case when some state variables are not observed."
models
statistics
inference
inverse-problems
nudge-targets
dynamical-systems
july 2010 by Vaguery
[1002.0377] Universal Laws and Economic Phenomena
july 2010 by Vaguery
Makes me want to write a simple agent-based model in which a few people have almost all the money and most everybody else are allowed to move a bit around, for a fee.
"This is a short commentary piece that discusses how the methods used in the natural sciences can apply to economics in general and financial markets specifically."
models
economics
statistics
physics-envy
"This is a short commentary piece that discusses how the methods used in the natural sciences can apply to economics in general and financial markets specifically."
july 2010 by Vaguery
[0903.5066] Modified-CS: Modifying Compressive Sensing for Problems with Partially Known Support
july 2010 by Vaguery
"We study the problem of reconstructing a sparse signal from a limited number of its linear projections when a part of its support is known, although the known part may contain some errors. The ``known" part of the support, denoted T, may be available from prior knowledge. Alternatively, in a problem of recursively reconstructing time sequences of sparse spatial signals, one may use the support estimate from the previous time instant as the ``known" part. The idea of our proposed solution (modified-CS) is to solve a convex relaxation of the following problem: find the signal that satisfies the data constraint and is sparsest outside of T.…"
compressed-sensing
algorithms
machine-learning
statistics
signal-processing
nudge-targets
data-analysis
july 2010 by Vaguery
[1007.4191] Fast Moment Estimation in Data Streams in Optimal Space
july 2010 by Vaguery
"We give a space-optimal algorithm with update time O(log^2(1/eps)loglog(1/eps)) for (1+eps)-approximating the pth frequency moment, 0 < p < 2, of a length-n vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous space-optimal algorithm of [Kane-Nelson-Woodruff, SODA 2010], which had update time Omega(1/eps^2)."
nudge-targets
algorithms
data-analysis
online-learning
machine-learning
computational-complexity
statistics
july 2010 by Vaguery
Environment for DeveLoping KDD-Applications Supported by Index-Structures - Wikipedia, the free encyclopedia
july 2010 by Vaguery
"Environment for DeveLoping KDD-Applications Supported by Index-Structures (ELKI) is a Knowledge Discovery in Databases (KDD, "data mining") software framework developed for use in research and teaching by the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures."
clustering
algorithms
libraries
data-analysis
exploratory-data-analysis
statistics
nudge
july 2010 by Vaguery
[1004.3246] The Complexity of Finding Reset Words in Finite Automata
june 2010 by Vaguery
"We study several problems related to finding reset words in deterministic finite automata. In particular, we establish that the problem of deciding whether a shortest reset word has length k is complete for the complexity class DP. This result answers a question posed by Volkov. For the search problems of finding a shortest reset word and the length of a shortest reset word, we establish membership in the complexity classes FP^NP and FP^NP[log], respectively. Moreover, we show that both these problems are hard for FP^NP[log]. Finally, we observe that computing a reset word of a given length is FNP-complete."
finite-state-machine
statistics
computational-mechanics
modeling
optimization
computational-complexity
nudge-targets
june 2010 by Vaguery
[1006.4968] Validation of credit default probabilities via multiple testing procedures
june 2010 by Vaguery
"We apply multiple testing procedures to the validation of estimated default probabilities in credit rating systems. The goal is to identify rating classes for which the probability of default is estimated inaccurately, while still maintaining a predefined level of committing type I errors as measured by the familywise error rate (FWER) and the false discovery rate (FDR). For FWER, we also consider procedures that take possible discreteness of the data resp. test statistics into account. The performance of these methods is illustrated in a simulation setting and for empirical default data."
finance
prediction
data-mining
models
statistics
machine-learning
nudge-targets
june 2010 by Vaguery
[1006.5273] Linear Detrending Subsequence Matching in Time-Series Databases
june 2010 by Vaguery
"Each time-series has its own linear trend, the directionality of a timeseries, and removing the linear trend is crucial to get the more intuitive matching results. Supporting the linear detrending in subsequence matching is a challenging problem due to a huge number of possible subsequences. In this paper we define this problem the linear detrending subsequence matching and propose its efficient index-based solution. To this end, we first present a notion of LD-windows (LD means linear detrending), which is obtained as follows: we eliminate the linear trend from a subsequence rather than each window itself and obtain LD-windows by dividing the subsequence into windows. Using the LD-windows we then present a lower bounding theorem for the index-based matching solution and formally prove its correctness.…"
time-series
data-mining
data-analysis
prediction
statistics
nudge-targets
june 2010 by Vaguery
[1006.3246] Sparse approaches for the exact distribution of patterns in long multi-states sequences generated by a Markov source
june 2010 by Vaguery
"We present two novel approaches for the computation of the exact distribution of a pattern in a long sequence. Both approaches take into account the sparse structure of the problem. The first approach relies on a partial recursion computing the largest eigenvalue of the the transition matrix of a Markov chain embedding. The second approach uses fast Taylor expansions of an exact bivariate rational reconstruction of the distribution. We illustrate the interest of both approaches on a simple toy-example and two biological applications: the transcription factors of the Human Chromosome 5 and the PROSITE signatures of functional motifs in proteins. On these examples our methods demonstrate their complementarity and their hability to extend the domain of feasibility for exact computations in pattern problems to a new level."
bioinformatics
nudge-targets
sequences
statistics
models
computational-mechanics
automata
june 2010 by Vaguery
[0911.4729] Hearing the clusters in a graph: A distributed algorithm
june 2010 by Vaguery
"We propose a novel distributed algorithm to cluster graphs. The algorithm recovers the solution obtained from spectral clustering without the need for expensive eigenvalue/vector computations. We prove that, by propagating waves through the graph, a local fast Fourier transform yields the local component of every eigenvector of the Laplacian matrix, which are used to cluster graphs. For large graphs, the proposed algorithm is orders of magnitude faster than random walk based approaches. We prove the equivalence of the proposed algorithm to spectral clustering and derive convergence rates. We also demonstrate the benefit of using this decentralized clustering algorithm to accelerate distributed estimation for sensor networks and for efficient computation of distributed multi-agent search strategies."
network-theory
graph-theory
clustering
algorithms
numerical-methods
statistics
nudge-targets
june 2010 by Vaguery
[1006.4330] Large gaps imputation in remote sensed imagery of the environment
june 2010 by Vaguery
"Imputation of missing data in large regions of satellite imagery is necessary when the acquired image has been damaged by shadows due to clouds, or information gaps produced by sensor failure.
The general approach for imputation of missing data, that could not be considered missed at random, suggests the use of other available data. Previous work, like local linear histogram matching, take advantage of a co-registered older image obtained by the same sensor, yielding good results in filling homogeneous regions, but poor results if the scenes being combined have radical differences in target radiance due, for example, to the presence of sun glint or snow.…"
nudge-targets
definitely-nudge-targets
imputation
statistics
machine-learning
data-analysis
The general approach for imputation of missing data, that could not be considered missed at random, suggests the use of other available data. Previous work, like local linear histogram matching, take advantage of a co-registered older image obtained by the same sensor, yielding good results in filling homogeneous regions, but poor results if the scenes being combined have radical differences in target radiance due, for example, to the presence of sun glint or snow.…"
june 2010 by Vaguery
[1006.4354] Empirical Modeling of Radiative versus Magnetic Flux for the Sun-as-a-Star
june 2010 by Vaguery
"…We find that a well-defined temporal component exists and accounts for some of the variance in the data. This temporal component arises because active regions with high magnetic field strength evolve, breaking up into small-scale magnetic elements with low field strength, and radiative and magnetic fluxes are sensitive to different active-region components. We generate empirical models that relate radiative flux to magnetic flux, allowing us to predict spectral-irradiance variations from observations of disk-averaged magnetic-flux density. In most cases, the model reconstructions can account for 85-90% of the variability of the radiative flux from the chromosphere and corona. Our results are important for understanding the relationship between magnetic and radiative measures of solar and stellar variability."
astronomy
astrophysics
modeling
learning-from-data
statistics
nudge-targets
june 2010 by Vaguery
[1006.3128] Fundamental Tradeoffs for Sparsity Pattern Recovery
june 2010 by Vaguery
"Recovery of the sparsity pattern (or support) of a sparse vector from a small number of noisy linear samples is a common problem that arises in signal processing and statistics. In the high dimensional setting, it is known that recovery with a vanishing fraction of errors is impossible if the sampling rate and per-sample signal-to-noise ratio (SNR) are finite constants independent of the length of the vector. In this paper, it is shown that recovery with an arbitrarily small but constant fraction of errors is, however, possible, and that in some cases a computationally simple thresholding estimator is near-optimal.…"
signal-processing
nudge-targets
information-theory
communication
numerical-methods
statistics
algorithms
approximation
heuristics
june 2010 by Vaguery
[0902.0600] Decisional States
june 2010 by Vaguery
"…The intrinsic underlying structure of the system is modeled by an epsilon-machine and its causal states. The decisional states are the emerging patterns corresponding to the utility function. In a complex systems perspective, these patterns thus form a partition of the lower-level system states that is defined according to the higher-level user's knowledge. The transitions between these decisional states correspond to events that lead to a change of decision. An algorithm is provided so as to estimate the states and their transitions from data. Application examples are given for hidden model reconstruction, cellular automata filtering, and edge detection in images."
computational-mechanics
information-theory
prediction
statistics
probability-theory
machine-learning
classification
june 2010 by Vaguery
[1006.1346] C-HiLasso: A Collaborative Hierarchical Sparse Modeling Framework
june 2010 by Vaguery
"Sparse modeling is a powerful framework for data analysis and processing. Traditionally, encoding in this framework is performed by solving an L1-regularized linear regression problem, commonly referred to as Lasso or Basis Pursuit. In this work we combine the sparsity-inducing property of the Lasso model at the individual feature level, with the block-sparsity property of the Group Lasso model, where sparse groups of features are jointly encoded, obtaining a sparsity pattern hierarchically structured. This results in the Hierarchical Lasso (HiLasso), which shows important practical modeling advantages.…"
numerical-methods
statistics
learning-from-data
machine-learning
image-processing
image-segmentation
nudge-targets
june 2010 by Vaguery
[1006.1328] Uncovering the Riffled Independence Structure of Rankings
june 2010 by Vaguery
"… In this paper, we provide a formal introduction to riffled independence and present algorithms for using riffled independence within Fourier-theoretic frameworks which have been explored by a number of recent papers. Additionally, we propose an automated method for discovering sets of items which are riffle independent from a training set of rankings. We show that our clustering-like algorithms can be used to discover meaningful latent coalitions from real preference ranking datasets and to learn the structure of hierarchically decomposable models based on riffled independence."
statistics
ranking
clustering
data-envelopment-analysis
multiobjective-optimization
nudge
numerical-methods
june 2010 by Vaguery
[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
june 2010 by Vaguery
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful.…Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
clustering
algorithms
statistics
models
classification
learning-from-data
june 2010 by Vaguery
[1006.3342] Local polynomial regression and variable selection
june 2010 by Vaguery
will I ever understand all the effort statisticians put into what I consider a solved problem? Pareto-GP is apparently utterly unknown, still
statistics
models-and-modes
modeling-is-not-mathematics
algorithms
regression
variable-selection
genetic-programming-target
june 2010 by Vaguery
[0907.5236] A Discussion on Mean Excess Plots
june 2010 by Vaguery
"A widely used tool in the study of risk, insurance and extreme values is the mean excess plot. One use is for validating a generalized Pareto model for the excess distribution. This paper investigates some theoretical and practical aspects of the use of the mean excess plot."
modeling
statistics
visualization
review
operations-research
extreme-values
june 2010 by Vaguery
[0812.3141] Choosing a penalty for model selection in heteroscedastic regression
june 2010 by Vaguery
"We consider the problem of choosing between several models in least-squares regression with heteroscedastic data. We prove that any penalization procedure is suboptimal when the penalty is a function of the dimension of the model, at least for some typical heteroscedastic model selection problems. In particular, Mallows' Cp is suboptimal in this framework. On the contrary, optimal model selection is possible with data-driven penalties such as resampling or $V$-fold penalties. Therefore, it is worth estimating the shape of the penalty from data, even at the price of a higher computational cost. Simulation experiments illustrate the existence of a trade-off between statistical accuracy and computational complexity. As a conclusion, we sketch some rules for choosing a penalty in least-squares regression, depending on what is known about possible variations of the noise-level."
statistics
statistical-tests
linear-regression
meta-optimization
nudge-targets
multiobjective-optimization
pragmatism-it-ain't
june 2010 by Vaguery
[1006.2307] Exploring the randomness of Directed Acyclic Networks
june 2010 by Vaguery
"The feed-forward relationship naturally observed in time-dependent processes and in a diverse number of real systems -such as some food-webs and electronic and neural wiring- can be described in terms of so-called directed acyclic graphs (DAGs). An important ingredient of the analysis of such networks is a proper comparison of their observed architecture against an ensemble of randomized graphs, thereby quantifying the {\em randomness} of the real systems with respect to suitable null models. This approximation is particularly relevant when the finite size and/or large connectivity of real systems make inadequate a comparison with the predictions obtained from the so-called {\em configuration model}. In this paper we analyze four methods of DAG randomization as defined by the desired combination of topological invariants (directed and undirected degree sequence and component distributions) aimed to be preserved.…"
networks
network-theory
graph-theory
algorithms
statistics
complexology
theoretical-biology
june 2010 by Vaguery
[1006.0849] Reconstruction of Causal Networks by Set Covering
june 2010 by Vaguery
"We present a method for the reconstruction of networks, based on the order of nodes visited by a stochastic branching process. Our algorithm reconstructs a network of minimal size that ensures consistency with the data. Crucially, we show that global consistency with the data can be achieved through purely local considerations, inferring the neighbourhood of each node in turn. The optimisation problem solved for each individual node can be reduced to a Set Covering Problem, which is known to be NP-hard but can be approximated well in practice. We then extend our approach to account for noisy data, based on the Minimum Description Length principle. We demonstrate our algorithms on synthetic data, generated by an SIR-like epidemiological model."
network-theory
modeling
statistics
learning-from-data
learning-by-doing
algorithms
nudge-targets
june 2010 by Vaguery
[1006.0764] General Purpose Convolution Algorithm in S4-Classes by means of FFT
june 2010 by Vaguery
"Object orientation provides a flexible framework for the implementation of the convolution of arbitrary distributions of real-valued random variables.
We discuss an algorithm which is based on the Discrete Fourier Transformation and its fast computability via the Fast Fourier Transformation. It directly applies to lattice-supported distributions. In the case of continuous distributions an additional discretization to a linear lattice is necessary and the resulting lattice-supported distributions are suitably smoothed after convolution."
statistics
R
library
probability-theory
libraries
open-source
nudge
We discuss an algorithm which is based on the Discrete Fourier Transformation and its fast computability via the Fast Fourier Transformation. It directly applies to lattice-supported distributions. In the case of continuous distributions an additional discretization to a linear lattice is necessary and the resulting lattice-supported distributions are suitably smoothed after convolution."
june 2010 by Vaguery
What is data science? - O'Reilly Radar
june 2010 by Vaguery
"We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data?
In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets."
data-analysis
data-mining
learning-from-data
statistics
futurism
drinking-from-the-firehose
nudge
via:tsuomela
In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets."
june 2010 by Vaguery
[0908.2503] Sequential Quantile Prediction of Time Series
june 2010 by Vaguery
"Motivated by a broad range of potential applications, we address the quantile prediction problem of real-valued time series. We present a sequential quantile forecasting model based on the combination of a set of elementary nearest neighbor-type predictors called "experts" and show its consistency under a minimum of conditions. Our approach builds on the methodology developed in recent years for prediction of individual sequences and exploits the quantile structure as a minimizer of the so-called pinball loss function. We perform an in-depth analysis of real-world data sets and show that this nonparametric strategy generally outperforms standard quantile prediction methods"
time-series
prediction
models
statistics
nudge-targets
learning-from-data
machine-learning
june 2010 by Vaguery
[1005.4358] On the estimation of the extremal index based on scaling and resampling
may 2010 by Vaguery
"The extremal index parameter theta characterizes the degree of local dependence in the extremes of a stationary time series and has important applications in a number of areas, such as hydrology, telecommunications, finance and environmental studies.…Further, a procedure for the automatic selection of its tuning parameter is developed and different types of confidence intervals that prove useful in practice proposed. The performance of the estimator is examined through simulations, which show its highly competitive behavior. Finally, the estimator is applied to three real data sets of daily crude oil prices, daily returns of the S&P 500 stock index, and high-frequency, intra-day traded volumes of a stock. These applications demonstrate additional diagnostic features of statistical plots based on the new estimator."
statistics
time-series
statistical-tests
nudge-targets
algorithms
extreme-values
may 2010 by Vaguery
[1005.4274] This is SPIRAL-TAP: Sparse Poisson Intensity Reconstruction ALgorithms - Theory and Practice
may 2010 by Vaguery
"The optimization formulation considered in this paper uses a penalized negative Poisson log-likelihood objective function with nonnegativity constraints (since Poisson intensities are naturally nonnegative). In particular, the proposed approach incorporates key ideas of using separable quadratic approximations to the objective function at each iteration and penalization terms related to l1 norms of coefficient vectors, total variation seminorms, and partition-based multiscale estimation methods."
optimization
models
statistics
algorithms
image-processing
image-analysis
umlauts
may 2010 by Vaguery
[1005.3680] Quantifying long-range correlations in complex networks beyond nearest neighbors
may 2010 by Vaguery
"We propose a fluctuation analysis to quantify spatial correlations in complex networks. The approach considers the sequences of degrees along shortest paths in the networks and quantifies the fluctuations in analogy to time series. In this work, the Barabasi-Albert (BA) model, the Cayley tree at the percolation transition, a fractal network model, and examples of real-world networks are studied. While the fluctuation functions for the BA model show exponential decay, in the case of the Cayley tree and the fractal network model the fluctuation functions display a power-law behavior. The fractal network model comprises long-range anti-correlations. The results suggest that the fluctuation exponent provides complementary information to the fractal dimension."
complexology
network-theory
physics
statistics
may 2010 by Vaguery
[1005.3579] Graph-Structured Multi-task Regression and an Efficient Optimization Method for General Fused Lasso
may 2010 by Vaguery
strangely, I have almost no idea what this is about; "multi-task regression" got me, though
machine-learning
statistics
I-guess
may 2010 by Vaguery
Random matrices in the news : Applied Statistics
may 2010 by Vaguery
"Now, to return to the news article. If the eigenvalue distribution is an attractor, this means that a lot of physical and social phenomena which can be modeled by eigenvalues (including, apparently, quantum energy levels and some properties of statistical tests) might have a common structure. Just as, at a similar level, we see the normal distribution and related functions in all sorts of unusual places."
random-matrix
statistics
complexology
physics
applied-mathematics
universality
may 2010 by Vaguery
[1005.2715] On the Subspace of Image Gradient Orientations
may 2010 by Vaguery
"We introduce the notion of Principal Component Analysis (PCA) of image gradient orientations. As image data is typically noisy, but noise is substantially different from Gaussian, traditional PCA of pixel intensities very often fails to estimate reliably the low-dimensional subspace of a given data population. We show that replacing intensities with gradient orientations and the $\ell_2$ norm with a cosine-based distance measure offers, to some extend, a remedy to this problem.…"
image-processing
signal-processing
image-analysis
machine-learning
statistics
PCA
nudge-targets
may 2010 by Vaguery
[1005.2979] Robust and Adaptive Algorithms for Online Portfolio Selection
may 2010 by Vaguery
"… Our methods use simple ideas from signal processing and statistics, which are sometimes overlooked in the empirical financial literature. The two approaches are evaluated against benchmark allocation techniques using 4 real datasets. Our methods outperform the benchmark allocation techniques in these datasets, in terms of both computational demand and financial performance."
trading
financial-engineering
stocks
machine-learning
statistics
algorithms
portfolio-theory
may 2010 by Vaguery
[0906.4779] Minimum Probability Flow Learning
may 2010 by Vaguery
"Learning in probabilistic models is often hampered by the general intractability of the normalization factor and its derivatives. Here we propose a new learning technique that obviates the need to compute an intractable normalization factor or sample from the equilibrium distribution of the model. This is achieved by establishing dynamics that would transform the observed data distribution into the model distribution, and then setting as the objective the minimization of the initial flow of probability away from the data distribution.…"
learning-from-data
statistics
machine-learning
estimation
algorithms
to-understand
may 2010 by Vaguery
[0905.0917] Determining interaction rules in animal swarms
may 2010 by Vaguery
"In this paper we introduce a method for determining local interaction rules in animal swarms. The method is based on the assumption that the behavior of individuals in a swarm can be treated as a set of mechanistic rules.
The principal idea behind the technique is to vary parameters that define a set of hypothetical interactions to minimize the deviation between the forces estimated from observed animal trajectories and the forces resulting from the assumed rule set. We demonstrate the method by reconstructing the interaction rules from the trajectories produced by a computer simulation."
inverse-problems
agent-based
boids
nudge-targets
statistics
model-discovery
The principal idea behind the technique is to vary parameters that define a set of hypothetical interactions to minimize the deviation between the forces estimated from observed animal trajectories and the forces resulting from the assumed rule set. We demonstrate the method by reconstructing the interaction rules from the trajectories produced by a computer simulation."
may 2010 by Vaguery
[0912.1567] Quantifying the Ease of Scientific Discovery
may 2010 by Vaguery
"It has long been known that scientific output proceeds on an exponential increase, or more properly, a logistic growth curve. The interplay between effort and discovery is clear, and the nature of the functional form has been thought to be due to many changes in the scientific process over time. Here I show a quantitative method for examining the ease of scientific progress, another necessary component in understanding scientific discovery. Using examples from three different scientific disciplines - mammalian species, chemical elements, and minor planets - I find the ease of discovery to conform to an exponential decay. In addition, I show how the pace of scientific discovery can be best understood as the outcome of both scientific output and ease of discovery."
science
arrival-times
statistics
innovation
empirical-economics
applicable-to-genetic-programming
metering
may 2010 by Vaguery
[1005.0182] A Multi Agent Model for the Limit Order Book Dynamics
may 2010 by Vaguery
"In the present work we introduce a novel multi-agent model with the aim to reproduce the dynamics of a double auction market at microscopic time scale through a faithful simulation of the matching mechanics in the limit order book. The model follows a "zero intelligence" approach where the actions of the traders are related to a stochastic variable, the market sentiment, which we define as a mixture of public and private information. The model, despite the parsimonious approach, is able to reproduce several empirical features of the high-frequency dynamics of the market microstructure not only related to the price movements but also to the deposition of the orders in the book."
modeling
agent-based
finance
markets
simulation
algorithms
statistics
may 2010 by Vaguery
[1005.2197] Scalable Tensor Factorizations for Incomplete Data
may 2010 by Vaguery
"Our numerical studies suggest that the proposed CP-WOPT approach is accurate and scalable. CP-WOPT can recover the underlying factors successfully with large amounts of missing data, e.g., 90% missing entries for tensors of size 50 × 40 × 30. We have also studied how CP-WOPT can scale to problems of larger sizes, e.g., 1000 × 1000 × 1000, and recover CP factors from large, sparse tensors with 99.5% missing data.…"
statistics
numerical-methods
missing-data
scientific-computing
algorithms
may 2010 by Vaguery
[1005.2314] Some comments on C. S. Wallace's random number generators
may 2010 by Vaguery
"Although care needs to be taken in the implementation of normal random number generators like fastnorm, and the end-user should be aware of the small but unavoidable defects discussed in §§5.6-5.7, these generators have such a performance advantage over more conventional generators that they can not be ignored in applications where the speed of generation of pseudo- random numbers is critical."
nudge-targets
pseudorandom-numbers
algorithms
statistics
computer-science
numerical-methods
may 2010 by Vaguery
[1005.0660] The Significant Digit Law in Statistical Physics
may 2010 by Vaguery
"The occurrence of the nonzero leftmost digit, i.e., 1, 2, ..., 9, of numbers from many real world sources is not uniformly distributed as one might naively expect, but instead, the nature favors smaller ones according to a logarithmic distribution, named Benford's law. We investigate three kinds of widely used physical statistics, i.e., the Boltzmann-Gibbs (BG) distribution, the Fermi-Dirac (FD) distribution, and the Bose-Einstein (BE) distribution, and find that the BG and FD distributions both fluctuate slightly in a periodic manner around the Benford distribution with respect to the temperature of the system, while the BE distribution conforms to it exactly whatever the temperature is. Thus the Benford's law seems to present a general pattern for physical statistics and might be even more fundamental and profound in nature. Furthermore, various elegant properties of Benford's law, especially the mantissa distribution of data sets, are discussed."
Benford's-law
mysteries-of-the-universe
number-theory
statistics
WTF
may 2010 by Vaguery
R Programming - Wikibooks, collection of open-content textbooks
may 2010 by Vaguery
"This is a guide to the R programming language."
R
R-language
documentation
learning
open-source
statistics
programming
may 2010 by Vaguery
[1005.1327] Statistical Model Checking : An Overview
may 2010 by Vaguery
"Quantitative properties of stochastic systems are usually specified in logics that allow one to compare the measure of executions satisfying certain temporal properties with thresholds. The model checking problem for stochastic systems with respect to such logics is typically solved by a numerical approach that iteratively computes (or approximates) the exact measure of paths satisfying relevant subformulas; the algorithms themselves depend on the class of systems being analyzed as well as the logic used for specifying the properties. Another approach to solve the model checking problem is to \emph{simulate} the system for finitely many runs, and use \emph{hypothesis testing} to infer whether the samples provide a \emph{statistical} evidence for the satisfaction or violation of the specification. In this short paper, we survey the statistical approach, and outline its main advantages in terms of efficiency, uniformity, and simplicity."
complexology
simulation
statistics
models
modeling-is-not-mathematics
inference
explanatory-power
may 2010 by Vaguery
related tags
a-rose-of-any-other-size ⊕ academia ⊕ academic ⊕ academic-culture ⊕ advice ⊕ agent-based ⊕ agile ⊕ agility ⊕ AIC ⊕ ain't-performance-space ⊕ algorithms ⊕ American ⊕ American-culture ⊕ analysis ⊕ analytics ⊕ Ann-Arbor ⊕ annotation ⊕ anomalies ⊕ applicable-to-genetic-programming ⊕ applications ⊕ applied-mathematics ⊕ approximation ⊕ architecture ⊕ archive ⊕ argumentation ⊕ arrival-times ⊕ assumptions ⊕ astronomy ⊕ astrophysics ⊕ astroturf ⊕ audio-segmentation ⊕ auditing ⊕ authority ⊕ automata ⊕ automation ⊕ bad-design ⊕ bars ⊕ Bayesian ⊕ Bayesianism ⊕ benchmarking ⊕ Benford's-law ⊕ bibliography ⊕ big-data-will-lead-to-big-inference ⊕ binding ⊕ biochemistry ⊕ bioinformatics ⊕ biology ⊕ boids ⊕ book ⊕ business ⊕ business-culture ⊕ business-model ⊕ business-plan ⊕ cause-and-effect ⊕ census ⊕ CFP ⊕ challenges ⊕ Chris-Anderson ⊕ citation ⊕ classification ⊕ clustering ⊕ cognitive-psychology ⊕ collaboration ⊕ communication ⊕ communities-of-practice ⊕ community ⊕ comping ⊕ complex-systems ⊕ complexology ⊕ compressed-sensing ⊕ computational-complexity ⊕ computational-linguistics ⊕ computational-mechanics ⊕ computer-science ⊕ computing ⊕ conferences ⊕ consulting ⊕ consumerism ⊕ contingency-of-all-models ⊕ coordination ⊕ Cosma-R-Shalizi ⊕ credentials ⊕ criticism-is-the-best-medicine ⊕ crowdsourcing ⊕ crystallography ⊕ cultural-norms ⊕ cure-for-dimensionality ⊕ data ⊕ data-analysis ⊕ data-collection ⊕ data-envelopment-analysis ⊕ data-mining ⊕ data-science ⊕ databases ⊕ definitely-nudge-targets ⊕ del.icio.us ⊕ demographics ⊕ design ⊕ design-of-measures ⊕ development ⊕ digitization ⊕ distance ⊕ distributed-processing ⊕ documentation ⊕ drinking-from-the-firehose ⊕ dynamic ⊕ dynamical-systems ⊕ economic-crisis ⊕ economics ⊕ empirical-economics ⊕ employment ⊕ emplyment ⊕ engineering ⊕ epidemiology ⊕ error ⊕ estimation ⊕ ethology ⊕ evidence ⊕ evidence-based ⊕ examples ⊕ exercise ⊕ expense ⊕ experiment ⊕ experimentation ⊕ explanation ⊕ explanatory-power ⊕ exploratory-data-analysis ⊕ extension ⊕ extreme-values ⊕ false-positives-false-negatives-and-other ⊕ false-quants ⊕ fast-food ⊕ fat-data ⊕ FDA ⊕ finance ⊕ financial-crisis ⊕ financial-engineering ⊕ finite-state-machine ⊕ firms ⊕ first-principles ⊕ fitness ⊕ folk-understanding ⊕ forecasting ⊕ free ⊕ freeware ⊕ frequentism ⊕ functional-data-analysis ⊕ futurism ⊕ genetic-programming ⊕ genetic-programming-target ⊕ genetics ⊕ geography ⊕ ggplot2 ⊕ go-for-the-header ⊕ goodness-of-fit ⊕ government ⊕ graph ⊕ graph-theory ⊕ graphic-design ⊕ graphics ⊕ graphs ⊕ gullibility ⊕ habits ⊕ healthcare ⊕ heuristics ⊕ hiring ⊕ histograms ⊕ how-to ⊕ hubris ⊕ hyperbole ⊕ hypothesis-testing ⊕ i-could-do-that ⊕ I-guess ⊕ i-need-the-name-for-this ⊕ ignorance ⊕ image-analysis ⊕ image-processing ⊕ image-segmentation ⊕ imagemagick ⊕ imputation ⊕ inference ⊕ information-theory ⊕ infrastructure ⊕ innovation ⊕ instructions ⊕ interestingness ⊕ interoperability ⊕ introduction ⊕ introductory ⊕ inverse-problems ⊕ investment ⊕ it's-the-great-plains-in-winter-you-decide ⊕ jobs ⊕ journalism ⊕ journals ⊕ law ⊕ learning ⊕ learning-by-doing ⊕ learning-by-watching ⊕ learning-from-data ⊕ libraries ⊕ library ⊕ linear-regression ⊕ linguistics ⊕ literacy ⊕ local ⊕ logic ⊕ long-depression ⊕ machine-learning ⊕ MacOS ⊕ magazines ⊕ manuscripts ⊕ map ⊕ MapReduce ⊕ market ⊕ marketing ⊕ markets ⊕ mathematics ⊕ media ⊕ medical-culture ⊕ medicine ⊕ meta-analysis ⊕ meta-optimization ⊕ metaheuristics ⊕ metamodeling ⊕ metaoptimization ⊕ metaphors ⊕ metering ⊕ methodologies ⊕ methods ⊕ metrics ⊕ misapplied-statistics ⊕ missing-data ⊕ model-discovery ⊕ modeling ⊕ modeling-is-not-mathematics ⊕ models ⊕ models-and-modes ⊕ more-marketing ⊕ multiobjective-optimization ⊕ mysteries-of-the-universe ⊕ natural-language-processing ⊕ network-theory ⊕ networks ⊕ NLP ⊕ no-really ⊕ nonemployer ⊕ nonparametric-methods ⊕ nonparametric-statistics ⊕ not-an-employee ⊕ notanemployee ⊕ nudge ⊕ nudge-targets ⊕ number-theory ⊕ numerical ⊕ numerical-methods ⊕ objectivity ⊕ OCR ⊕ online-learning ⊕ ontology ⊕ open-sc ⊕ open-science ⊕ open-source ⊕ openness ⊕ operations-research ⊕ optimization ⊕ p-values ⊕ paper ⊕ papers ⊕ pattern-discovery ⊕ PCA ⊕ pedagogy ⊕ peer-review ⊕ performance ⊕ performance-measure ⊕ pharmaceutical ⊕ philosophy ⊕ phylogenetics ⊕ physical-anthropology ⊕ physics ⊕ physics-envy ⊕ planning ⊕ policy ⊕ politics ⊕ polling ⊕ popularization ⊕ portfolio-theory ⊕ positive-feedback ⊕ power-law ⊕ pragmatism ⊕ pragmatism-it-ain't ⊕ prediction ⊕ prejudice ⊕ preprint ⊕ pretty ⊕ printing ⊕ probability ⊕ probability-theory ⊕ problem-solving ⊕ proceedings ⊕ programming ⊕ project ⊕ promotion ⊕ propaganda ⊕ propensity ⊕ pseudorandom-numbers ⊕ psychology ⊕ psychometrics ⊕ public-policy ⊕ publishing ⊕ Python ⊕ R ⊕ R-language ⊕ race ⊕ racism ⊕ Rails ⊕ random-matrix ⊕ ranking ⊕ RApache ⊕ rationality ⊕ raw-data-soon ⊕ reasoning ⊕ received-wisdom ⊕ recession ⊕ recommendations ⊕ reference ⊕ regression ⊕ reporting ⊕ research ⊕ resilience ⊕ restaurants ⊕ results ⊕ review ⊕ reviews ⊕ rights ⊕ risk ⊕ robustness ⊕ RoR ⊕ rsRuby ⊕ ruby ⊕ rubygem ⊕ sales ⊕ science ⊕ science2.0 ⊕ scientific-computing ⊕ scientific-model-fallacies ⊕ scripting ⊕ search-engines ⊕ security ⊕ sequences ⊕ service ⊕ signal-processing ⊕ significance ⊕ simulation ⊕ small-business ⊕ smartmobs ⊕ social ⊕ social-engineering ⊕ social-networks ⊕ social-norms ⊕ social-sciences ⊕ sociology ⊕ software ⊕ standard-setting-play ⊕ statistical-tests ⊕ statisticians-don't-do-Pragmatism-well ⊕ statistics ⊖ stocks ⊕ storytelling ⊕ structural-biology ⊕ structure ⊕ stupidity ⊕ summary ⊕ symbolic-regression ⊕ sysadmin ⊕ system-administration ⊕ tacit-knowledge ⊕ tagging ⊕ taxonomy ⊕ technical ⊕ techniques ⊕ television ⊕ testing ⊕ text-mining ⊕ that-Greek-dude-with-the-wings-that-melted ⊕ the-world-doesn't-give-a-damn-what-stories-we-tell-about-it ⊕ theoretical-biology ⊕ theory ⊕ theory-and-practice-sitting-in-a-tree ⊕ thesis ⊕ things-to-ask-Cosma-about ⊕ time-series ⊕ to-read ⊕ to-understand ⊕ tools ⊕ trading ⊕ trained-incapacity ⊕ training ⊕ trends ⊕ tutorial ⊕ umlauts ⊕ unemployment ⊕ universality ⊕ user-experience ⊕ validation ⊕ variable-selection ⊕ via:? ⊕ via:arsyed ⊕ via:arthegall ⊕ via:cshalizi ⊕ via:dunrie ⊕ via:jhofman ⊕ via:mark.larios ⊕ via:tsuomela ⊕ visualization ⊕ web ⊕ web-design ⊕ web2.0 ⊕ when-in-Roma ⊕ why-does-it-take-26-pages-of-maths-before-we-try-it? ⊕ wikipedia ⊕ wisdom-of-crowds ⊕ Workantile ⊕ worklife ⊕ worldviews ⊕ writing ⊕ wrong ⊕ WTF ⊕Copy this bookmark: