Vaguery + statistics   175

Attractive Models - Kieran Healy
"Now, if you write a paper describing negative results—a model where nothing is significant—then you may have a hard time getting it published. In the absence of some specific controversy, negative results are boring. For the same reason, though, if your results just barely cross the threshold of conventional significance, they may stand a disproportionately better chance of getting published than an otherwise quite similar paper where the results just failed to make the threshold. And this is what the graph above shows, for papers published in the American Political Science Review. It’s a histogram of p-values for coefficients in regressions reported in the journal. The dashed line is the conventional threshold for significance. The tall red bar to the right of the dashed line is the number of coefficients that just made it over the threshold, while the short red bar is the number of coefficients that just failed to do so. If there were no bias in the publication process, the shape of the histogram would approximate the right-hand side of a bell curve. The gap between the big and the small red bars is a consequence of two things: the unwillingness of journals to report negative results, and the efforts of authors to search for (and write up) results that cross the conventional threshold."
statistics  academic-culture  publishing  meta-analysis 
27 days ago by Vaguery
No, physicians don’t understand screening statistics | The Incidental Economist
"So basically,when it comes to saving lives, docs are three times more likely to recommend a screening test based on irrelevant data than they are to recommend it based on relevant data. I’m bracing myself for the hate mail, but this is part of the reason why I’m skeptical that just providing docs with more evidence will change the way they practice. Most docs just aren’t trained to understand this stuff."
medical-culture  healthcare  statistics  probability-theory  planning 
4 weeks ago by Vaguery
An algorithm is just an algorithm | Gene Expression | Discover Magazine
"Another illustration that knowledge comes not through blind adherence to methods, but human reflection."
algorithms  statistics  storytelling  i-need-the-name-for-this 
4 weeks ago by Vaguery
[1203.3353] Solving Structure with Sparse, Randomly-Oriented X-ray Data
"Single-particle imaging experiments of biomolecules at x-ray free-electron lasers (XFELs) require processing of hundreds of thousands (or more) of images that contain very few x-rays. Each low-flux image of the diffraction pattern is produced by a single, randomly oriented particle, such as a protein. We demonstrate the feasibility of collecting data at these extremes, averaging only 2.5 photons per frame, where it seems doubtful there could be information about the state of rotation, let alone the image contrast. This is accomplished with an expectation maximization algorithm that processes the low-flux data in aggregate, and without any prior knowledge of the object or its orientation. The versatility of the method promises, more generally, to redefine what measurement scenarios can provide useful signal in the high-noise regime."
structural-biology  image-analysis  crystallography  algorithms  inverse-problems  nudge-targets  statistics 
9 weeks ago by Vaguery
[1203.3284] Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data
"We study a character-based phylogeny reconstruction problem when an incomplete set of data is given. More specifically, we consider the situation under the directed perfect phylogeny assumption with binary characters in which for some species the states of some characters are missing. Our main object is to give an efficient algorithm to enumerate (or list) all perfect phylogenies that can be obtained when the missing entries are completed. While a simple branch-and-bound algorithm (B&B) shows a theoretically good performance, we propose another approach based on a zero-suppressed binary decision diagram (ZDD). Experimental results on randomly generated data exhibit that the ZDD approach outperforms B&B. We also prove that counting the number of phylogenetic trees consistent with a given data is #P-complete, thus providing an evidence that an efficient random sampling seems hard."
phylogenetics  inverse-problems  genetics  algorithms  statistics  nudge-targets 
9 weeks ago by Vaguery
[1203.1975] Warped Functional Regression
"A characteristic feature of functional data is the presence of time variability in addition to amplitude variability. The existing functional regression methods do not handle time variability in an explicit and efficient way. In this paper we introduce a functional regression method that incorporates time warping as an intrinsic part of the model. The method achieves good predictive power in a parsimonious way, and allows for unified statistical inference of time and amplitude variability. The properties of the estimators are studied by simulation, and an application to the modeling of ground-level ozone trajectories is presented."
statistics  time-series  modeling  algorithms 
10 weeks ago by Vaguery
[1203.1065] Subspace clustering of high-dimensional data: a predictive approach
"In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results."
ain't-performance-space  statistics  clustering  cure-for-dimensionality  algorithms 
10 weeks ago by Vaguery
[1111.3304] Eigenvector Synchronization, Graph Rigidity and the Molecule Problem
"The graph realization problem has received a great deal of attention in recent years, due to its importance in applications such as wireless sensor networks and structural biology.…"
algorithms  statistics  structure  learning-from-data  nudge-targets 
11 weeks ago by Vaguery
[1003.5956] Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms
"…In this paper, we introduce a replay method- ology for contextual bandit algorithm evaluation. Different from simulator-based approaches, our method is completely data-driven and very easy to adapt to different applications. More importantly, our method can provide provably unbi- ased evaluations. Our empirical results on a large-scale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show ac- curacy and effectiveness of our offline evaluation method."
classification  recommendations  algorithms  machine-learning  crowdsourcing  nudge-targets  statistics 
11 weeks ago by Vaguery
Visualization series: Insight from Cleveland and Tufte on plotting numeric data by groups | Solomon Messing
"A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below).  Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points.  Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive."
visualization  data-analysis  communication  graphic-design  argumentation  statistics  ggplot2 
11 weeks ago by Vaguery
[1112.6235] Detecting a Vector Based on Linear Measurements
We consider a situation where the state of a system is represented by a real-valued vector. Under normal circumstances, the vector is zero, while an event manifests as non-zero entries in this vector, possibly few. Our interest is in the design of algorithms that can reliably detect events (i.e., test whether the vector is zero or not) with the least amount of information. We place ourselves in a situation, now common in the signal processing literature, where information about the vector comes in the form of noisy linear measurements. We derive information bounds in an active learning setup and exhibit some simple near-optimal algorithms. In particular, our results show that the task of detection within this setting is at once much easier, simpler and different than the tasks of estimation and support recovery.
signal-processing  statistics  algorithms  nudge-targets 
january 2012 by Vaguery
[1109.2215] Finding missing edges and communities in incomplete networks
Many algorithms have been proposed for predicting missing edges in networks, but they do not usually take account of which edges are missing. We focus on networks which have missing edges of the form that is likely to occur in real networks, and compare algorithms that find these missing edges. We also investigate the effect of this kind of missing data on community detection algorithms.
network-theory  algorithms  inference  statistics  nudge-targets 
january 2012 by Vaguery
[1010.4735] Exploring the Energy Landscapes of Protein Folding Simulations with Bayesian Computation
Nested sampling is a Bayesian sampling technique developed to explore probability distributions lo- calised in an exponentially small area of the parameter space. The algorithm provides both posterior samples and an estimate of the evidence (marginal likelihood) of the model. The nested sampling algo- rithm also provides an efficient way to calculate free energies and the expectation value of thermodynamic observables at any temperature, through a simple post-processing of the output. Previous applications of the algorithm have yielded large efficiency gains over other sampling techniques, including parallel tempering (replica exchange). In this paper we describe a parallel implementation of the nested sampling algorithm and its application to the problem of protein folding in a Go-type force field of empirical potentials that were designed to stabilize secondary structure elements in room-temperature simulations. We demonstrate the method by conducting folding simulations on a number of small proteins which are commonly used for testing protein folding procedures: protein G, the SH3 domain of Src tyrosine kinase and chymotrypsin inhibitor 2. A topological analysis of the posterior samples is performed to produce energy landscape charts, which give a high level description of the potential energy surface for the protein folding simulations. These charts provide qualitative insights into both the folding process and the nature of the model and force field used.
structural-biology  biochemistry  modeling  algorithms  statistics  metamodeling 
january 2012 by Vaguery
[1109.3248] Reconstruction of sequential data with density models
We introduce the problem of reconstructing a sequence of multidimensional real vectors where some of the data are missing. This problem contains regression and mapping inversion as particular cases where the pattern of missing data is independent of the sequence index. The problem is hard because it involves possibly multivalued mappings at each vector in the sequence, where the missing variables can take more than one value given the present variables; and the set of missing variables can vary from one vector to the next. To solve this problem, we propose an algorithm based on two redundancy assumptions: vector redundancy (the data live in a low-dimensional manifold), so that the present variables constrain the missing ones; and sequence redundancy (e.g. continuity), so that consecutive vectors constrain each other. We capture the low-dimensional nature of the data in a probabilistic way with a joint density model, here the generative topographic mapping, which results in a Gaussian mixture. Candidate reconstructions at each vector are obtained as all the modes of the conditional distribution of missing variables given present variables. The reconstructed sequence is obtained by minimising a global constraint, here the sequence length, by dynamic programming. We present experimental results for a toy problem and for inverse kinematics of a robot arm.
inverse-problems  statistics  algorithms  learning-from-data  nudge-targets 
january 2012 by Vaguery
[1112.6178] A general framework for online audio source separation
We consider the problem of online audio source separation. Existing algorithms adopt either a sliding block approach or a stochastic gradient approach, which is faster but less accurate. Also, they rely either on spatial cues or on spectral cues and cannot separate certain mixtures. In this paper, we design a general online audio source separation framework that combines both approaches and both types of cues. The model parameters are estimated in the Maximum Likelihood (ML) sense using a Generalised Expectation Maximisation (GEM) algorithm with multiplicative updates. The separation performance is evaluated as a function of the block size and the step size and compared to that of an offline algorithm.
signal-processing  audio-segmentation  statistics  algorithms  metaheuristics  nudge-targets 
january 2012 by Vaguery
[1112.0826] Clustering under Perturbation Resilience
Recently, Bilu and Linial formalized an implicit assumption often made when choosing a clustering objective: that the optimum clustering to the objective should be preserved under small multiplicative perturbations to distances between points. They showed that for max-cut clustering it is possible to circumvent NP-hardness and obtain polynomial-time algorithms for instances resilient to large (factor $O(sqrt{n})$) perturbations, and subsequently Awasthi et al. considered center-based objectives, giving algorithms for instances resilient to O(1) factor perturbations.
In this paper, we greatly advance this line of work. For center-based objectives, we present an algorithm that can optimally cluster instances resilient to $(1 + sqrt{2})$-factor perturbations, solving an open problem of Awasthi et al. For a commonly used center-based objective $k$-median, we additionally give algorithms for a more relaxed assumption in which we allow the optimal solution to change in a small $epsilon$ fraction of the points after perturbation. We give the first bounds known for this more realistic and more general setting. We also provide positive results for min-sum clustering which is a generally much harder objective than $k$-median (and also non-center-based). Our algorithms are based on new linkage criteria that may be of independent interest.
Additionally, we give sublinear-time algorithms, showing algorithms that can return an implicit clustering from only access to a small random sample.
clustering  statistics  nonparametric-methods  robustness  resilience  algorithms  nudge-targets 
january 2012 by Vaguery
[1112.5794] BATMAN-an R package for the automated quantification of metabolites from NMR spectra using a Bayesian Model
Motivation: NMR spectra are widely used in metabolomics to obtain metabolite profiles in complex biological mixtures. Common methods used to assign and estimate concentrations of metabolite involve either an expert manual peak fitting or extra pre-processing steps, such as peak alignment and binning. Peak fitting is very time consuming and is subject to human error. Conversely, alignment and binning can introduce artifacts and limit immediate biological interpretation of models. Results: We present the Bayesian AuTomated Metabolite Analyser for NMR spectra (BATMAN), an R package which deconvolves peaks from 1-dimensional NMR spectra, automatically assigns them to specific metabolites and obtains concentration estimates. The Bayesian model incorporates information on characteristic peak patterns of metabolites and is able to account for shifts in the position of peaks commonly seen in NMR spectra of biological samples. It applies a Markov Chain Monte Carlo (MCMC) algorithm to sample from a joint posterior distribution of the model parameters and obtains concentration estimates with reduced mean estimation error compared with conventional numerical integration methods.
learning-from-data  statistics  modeling  biochemistry  nudge-targets  image-segmentation 
january 2012 by Vaguery
[1109.5664] Deterministic Feature Selection for $k$-means Clustering
"We study feature selection for $k$-means clustering. Although the literature contains many methods with good empirical performance, algorithms with provable theoretical behavior have only recently been developed. Unfortunately, these algorithms are randomized and fail with, say, a constant probability. We address this issue by presenting a emph{deterministic} feature selection algorithm for $k$-means with theoretical guarantees. At the heart of our algorithm lies a deterministic method for decompositions of the identity."
clustering  statistics  algorithms  nudge-targets 
december 2011 by Vaguery
[1107.2379] Data Stability in Clustering: A Closer Look
"This paper considers the model introduced by Bilu and Linial (2010), who study problems for which the optimal clustering does not change when the distances are perturbed by multiplicative factors. They show that even when a problem is NP-hard, it is sometimes possible to obtain polynomial-time algorithms for instances resilient to large perturbations, e.g. on the order of $O(sqrt{n})$ for max-cut clustering. Awasthi et al. (2010) extend this line of work by considering center-based objectives, and Balcan and Liang (2011) consider the $k$-median and min-sum objectives, giving efficient algorithms for instances resilient to certain constant multiplicative perturbations.

Here, we are motivated by the question of to what extent these assumptions can be relaxed while allowing for efficient algorithms. We show there is little room to improve these results by giving NP-hardness lower bounds for both the $k$-median and min-sum objectives. On the other hand, we show that multiplicative resilience parameters, even only on the order of $Theta(1)$, can be so strong as to make the clustering problem trivial, and we exploit these assumptions to present a simple one pass streaming algorithm for the $k$-median objective. We also consider a model of additive perturbations and give a correspondence between additive and multiplicative notions of stability. Our results provide a close examination of the consequences of assuming, even constant, stability in data."
clustering  statistics  algorithms  robustness  nudge-targets 
december 2011 by Vaguery
[1110.0463] A binary noisy channel to model errors in printing process
To model printing noise a binary noisy channel and a set of controlled gates are introduced. The channel input is an image created by a halftoning algorithm and its output is the printed picture. Using this channel robustness to noise between halftoning algorithms can be studied. We introduced relative entropy to describe immunity of the algorithm to noise and tested several halftoning algorithms.
printing  modeling  inverse-problems  simulation  statistics  nudge-targets 
november 2011 by Vaguery
[1110.1462] Dynamic Clustering of Histogram Data Based on Adaptive Squared Wasserstein Distances
"…To cluster sets of histogram data, we propose to use Dynamic Clustering Algorithm, (based on adaptive squared Wasserstein distances) that is a k-means-like algorithm for clustering a set of individuals into K classes that are apriori fixed. The main aim of this research is to provide a tool for clustering histograms, emphasizing the different contributions of the histogram variables, and their components, to the definition of the clusters. We demonstrate that this can be achieved using adaptive distances.

Two kind of adaptive distances are considered: the first takes into account the variability of each component of each descriptor for the whole set of individuals; the second takes into account the variability of each component of each descriptor in each cluster. We furnish interpretative tools of the obtained partition based on an extension of the classical measures (indexes) to the use of adaptive distances in the clustering criterion function. Applications on synthetic and real-world data corroborate the proposed procedure."
classification  statistics  histograms  metrics  clustering 
october 2011 by Vaguery
[1110.0725] A Survey of Distributed Data Aggregation Algorithms
"Distributed data aggregation has been an active field of research in the last decade, and a huge diverse amount of techniques can be found in the literature. For this reasons, this survey intends to be an important time saving instrument, for those that desire to get a quick and comprehensive overview of the state of the art on distributed data aggregation. Moreover, by carefully highlighting the strength and limitations of the more pertinent approaches, this study can provide a useful assistance to help readers choose which technique to apply in specific settings.

Currently, there is no ideal general solution to the distributed computation of an aggregation function, all existing techniques have its pitfalls (some more than others). Therefore, more research in this field will be expected in the next few years. In particular, due to the added value of computing complex aggregates, new algorithms might arise to estimate the statistical distribution of values, as the few existing approaches exhibit some limitations in terms of accuracy and resource consumption. Additional research efforts should be made to improve the support to churn, message loss, and continuous estimation of mutable input values."
statistics  reviews  distributed-processing  communication  coordination  nudge-targets 
october 2011 by Vaguery
Even Tiny Bouts of Exercise are Associated with Increased Fitness | Obesity Panacea
"These results are encouraging and suggest that random, short duration physical activity, which may be more feasible and enjoyable for inactive individuals attempting to engage in physical activity for health benefit, is indeed beneficial."
exercise  healthcare  fitness  statistics 
june 2011 by Vaguery
Weighty Matters: Is sodium a dietary red herring for the effects of processed foods?
"I think there's at least one more possibility:

3. Sodium's isn't a causal agent of disease but instead given that processed foods are phenomenally high in sodium, is a useful biomarker for the degree of processed foods a person's consuming, and that it's the huge volumes of sugar and pulverized flour (that's more often than not packaged with gobs of sodium) that's actually causal for cardiovascular disease and death."
healthcare  statistics  medical-culture  consumerism  fast-food 
june 2011 by Vaguery
Doctors are human | The Incidental Economist
"…But this is America. If you want to have the procedure, so be it. You get to choose. That’s the way we roll.

My question is, did your doctor recommend it? Did your doctor tell you about this study? Do you think that those who recommend and perform this procedure don’t know about this study, and that if only they had this evidence they’d stop?

Or, do you think physicians are influenced by biases and their personal beliefs? Me? I think they’re human."
medical-culture  statistics  healthcare  marketing  cognitive-psychology  evidence-based 
june 2011 by Vaguery
The distribution of interestingness | (R news & tutorials)
"The longer – and far less satisfying – answer to the question of how interestingness measures should be distributed is, “it depends,” as the following discussion illustrates."
statistics  interestingness  design-of-measures  statisticians-don't-do-Pragmatism-well  learning-from-data 
may 2011 by Vaguery
Growing need for data heads
"I've said it before, but if digging into data is your idea of fun, there's a whole mess of excitement and adventure headed your way. There are lots of opportunities already out there in marketing, journalism, tech, the Web, government, and pretty much everywhere you look. And more importantly, there are lots of opportunities that you can make for yourself. This is a great time for data heads."
data-science  data-mining  statistics  jobs  advice 
may 2011 by Vaguery
Friday fun projects | (R news & tutorials)
At some point, I’ll turn to my favourite web application combo: Sinatra + MongoDB + Highcharts, to visualize these data dynamically on a web page. For now though, we can get a quick idea and create even more Friday fun by learning how to use RApache to run and view R code in the browser.
Ruby  R-language  visualization  statistics  programming  learning-by-doing 
may 2011 by Vaguery
ashleyw/phrasie - GitHub
Determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength.
Ruby  library  tagging  natural-language-processing  NLP  statistics  text-mining 
may 2011 by Vaguery
[1102.3220] A signal recovery algorithm for sparse matrix based compressed sensing
"Even when the numbers of non-zero entries per column/row in the measurement matrices are limited to $O(1)$, numerical experiments indicate that the algorithm can still typically recover the original signal perfectly with an $O(N)$ computational cost per update as well if the density $\rho$ of non-zero entries of the signal is lower than a certain critical value $\rho_{\rm th}(\alpha)$ as $N,M \to \infty$."
compressed-sensing  algorithms  signal-processing  nudge-targets  machine-learning  statistics  from delicious
april 2011 by Vaguery
[0807.1271] Semiparametric curve alignment and shift density estimation for biological data
"Assume that we observe a large number of curves, all of them with identical, although unknown, shape, but with a different random shift. The objective is to estimate the individual time shifts and their distribution. Such an objective appears in several biological applications like neuroscience or ECG signal processing, in which the estimation of the distribution of the elapsed time between repetitive pulses with a possibly low signal-noise ratio, and without a knowledge of the pulse shape is of interest. We suggest an M-estimator leading to a three-stage algorithm: we split our data set in blocks, on which the estimation of the shifts is done by minimizing a cost criterion based on a functional of the periodogram; the estimated shifts are then plugged into a standard density estimator. We show that under mild regularity assumptions the density estimate converges weakly to the true shift distribution. The theory is applied both to simulations and to alignment of real ECG signals.…"
data-analysis  statistics  algorithms  heuristics  exploratory-data-analysis  nudge  optimization  classification  time-series 
august 2010 by Vaguery
[1008.1414] Statistically validated networks in bipartite complex systems
"Many complex systems present an intrinsic bipartite nature and are often described and modeled in terms of networks [1-5]. Examples include movies and actors [1, 2, 4], authors and scientific papers [6-9], email accounts and emails [10], plants and animals that pollinate them [11, 12]. Bipartite networks are often very heterogeneous in the number of relationships that the elements of one set establish with the elements of the other set. … Here we introduce an unsupervised method to statistically validate each link of the projected network against a null hypothesis taking into account the heterogeneity of the system. We apply our method to three different systems…. In all these systems, both different in size and level of heterogeneity, we find that our method is able to detect network structures which are informative about the system…"
complexology  network-theory  algorithms  machine-learning  nudge-targets  inference  statistics 
august 2010 by Vaguery
[1008.1758] Stochastic Data Clustering
"In 1961 Herbert Simon and Albert Ando published the theory behind the long-term behavior of a dynamical system that can be described by a nearly completely decomposable matrix. Over the past fifty years this theory has been used in a variety of contexts, including queueing theory, computer performance, and ecology. In all these applications, the structure of the system is known and the point of interest is the various states the system passes through on its way to some long-term equilibrium. This paper looks at this problem from the other direction. That is, we develop a technique for using the evolution of the system to tell us about its initial structure, and we use this technique to develop a new algorithm for data clustering."
clustering  data-analysis  exploratory-data-analysis  statistics  algorithms 
august 2010 by Vaguery
[1007.5516] Variable importance and model selection by decorrelation
"We introduce a simple criterion, the CAR score, for ranking and selecting variables in linear regression. The CAR score arises naturally in the best predictor formulation of the linear model, offers a canonical decomposition of the proportion of explained variance, and also takes account of correlation and grouping structure among explanatory variables. As population quantity the CAR score is not tied to any specific inference paradigm. Variable selection based on AIC, $C_p$, BIC, and other information criteria is shown to be equivalent to thresholding CAR scores at a fixed level, whereas using false discovery rates corresponds to an adaptive cutoff. In computer simulations we show that CAR scores are highly effective for variable selection with a prediction error that compares favorable with the elastic net and similar regression procedures. We illustrate the approach by analyzing diabetes data as well as gene expression data from the human frontal cortex."
statistics  variable-selection  algorithms  information-theory  models  heuristics 
august 2010 by Vaguery
[0911.5460] Thresholding-based Iterative Selection Procedures for Generalized Linear Models
"High-dimensional correlated data pose challenges in model selection and predictive learning. In this paper, we derive an iterative thresholding technique for generalized linear models (GLMs) with possibly nonorthogonal designs. We propose a family of $\Theta$-estimators which are associated with penalized likelihoods and can be computed by thresholding-based iterative procedures. It can also be used to robustify GLMs and extend the canonical $M$-estimators.…"
variable-selection  statistics  models  modeling 
august 2010 by Vaguery
[1007.5510] An algorithm for the principal component analysis of large data sets
"Recently popularized randomized methods for principal component analysis (PCA) efficiently and reliably produce nearly optimal accuracy - even on parallel processors - unlike the classical (deterministic) alternatives. We adapt one of these randomized methods for use with data sets that are too large to be stored in random-access memory (RAM). (The traditional terminology is that our procedure works efficiently "out-of-core.") We illustrate the performance of the algorithm via several numerical examples. For example, we report on the PCA of a data set stored on disk that is so large that less than a hundredth of it can fit in our computer's RAM."
algorithms  big-data-will-lead-to-big-inference  statistics  data-mining  exploratory-data-analysis 
august 2010 by Vaguery
[1007.1075] Clustering Stability: An Overview
"A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications."
statistics  data-analysis  clustering  nonparametric-statistics  exploratory-data-analysis  heuristics 
august 2010 by Vaguery
[1007.3254] Distinguishing Fact from Fiction: Pattern Recognition in Texts Using Complex Networks
"We establish concrete mathematical criteria to distinguish between different kinds of written storytelling, fictional and non-fictional. Specifically, we constructed a semantic network from both novels and news stories, with $N$ independent words as vertices or nodes, and edges or links allotted to words occurring within $m$ places of a given vertex; we call $m$ the word distance. We then used measures from complex network theory to distinguish between news and fiction, studying the minimal text length needed as well as the optimized word distance $m$. The literature samples were found to be most effectively represented by their corresponding power laws over degree distribution $P(k)$ and clustering coefficient $C(k)$; we also studied the mean geodesic distance, and found all our texts were small-world networks.…"
nudge-targets  computational-linguistics  linguistics  classification  machine-learning  statistics  natural-language-processing 
august 2010 by Vaguery
[1006.5731] A Taxonomy of Networks
"The study of networks has grown into a substantial interdisciplinary endeavor across the natural, social, and information sciences. Yet there have been very few attempts to investigate the interrelatedness of the different classes of networks studied by different disciplines. Here, we introduced a framework to establish a taxonomy of networks from various origins. The provision of this family tree not only helps understand the kinship of networks, but also facilitates the transfer of empirical analysis, theoretical modeling, and conceptual developments across disciplinary boundaries. The framework is based on probing the mesoscopic properties of networks, an important source of heterogeneity for their structure and function. Using our method, we computed a taxonomy for 752 individual networks and a separate taxonomy for 12 network classes. We also computed three within-class taxonomies for political, fungal, and financial networks, and found them to be insightful in each case."
nudge-targets  classification  models  network-theory  statistics  complexology  ontology  taxonomy 
july 2010 by Vaguery
[0906.5321] Efficient statistical inference for stochastic reaction processes
"We address the problem of estimating unknown model parameters and state variables in stochastic reaction processes when only sparse and noisy measurements are available. Using an asymptotic system size expansion for the backward equation we derive an efficient approximation for this problem. We demonstrate the validity of our approach on model systems and generalize our method to the case when some state variables are not observed."
models  statistics  inference  inverse-problems  nudge-targets  dynamical-systems 
july 2010 by Vaguery
[1002.0377] Universal Laws and Economic Phenomena
Makes me want to write a simple agent-based model in which a few people have almost all the money and most everybody else are allowed to move a bit around, for a fee.

"This is a short commentary piece that discusses how the methods used in the natural sciences can apply to economics in general and financial markets specifically."
models  economics  statistics  physics-envy 
july 2010 by Vaguery
[0903.5066] Modified-CS: Modifying Compressive Sensing for Problems with Partially Known Support
"We study the problem of reconstructing a sparse signal from a limited number of its linear projections when a part of its support is known, although the known part may contain some errors. The ``known" part of the support, denoted T, may be available from prior knowledge. Alternatively, in a problem of recursively reconstructing time sequences of sparse spatial signals, one may use the support estimate from the previous time instant as the ``known" part. The idea of our proposed solution (modified-CS) is to solve a convex relaxation of the following problem: find the signal that satisfies the data constraint and is sparsest outside of T.…"
compressed-sensing  algorithms  machine-learning  statistics  signal-processing  nudge-targets  data-analysis 
july 2010 by Vaguery
[1007.4191] Fast Moment Estimation in Data Streams in Optimal Space
"We give a space-optimal algorithm with update time O(log^2(1/eps)loglog(1/eps)) for (1+eps)-approximating the pth frequency moment, 0 < p < 2, of a length-n vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous space-optimal algorithm of [Kane-Nelson-Woodruff, SODA 2010], which had update time Omega(1/eps^2)."
nudge-targets  algorithms  data-analysis  online-learning  machine-learning  computational-complexity  statistics 
july 2010 by Vaguery
Environment for DeveLoping KDD-Applications Supported by Index-Structures - Wikipedia, the free encyclopedia
"Environment for DeveLoping KDD-Applications Supported by Index-Structures (ELKI) is a Knowledge Discovery in Databases (KDD, "data mining") software framework developed for use in research and teaching by the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures."
clustering  algorithms  libraries  data-analysis  exploratory-data-analysis  statistics  nudge 
july 2010 by Vaguery
[1004.3246] The Complexity of Finding Reset Words in Finite Automata
"We study several problems related to finding reset words in deterministic finite automata. In particular, we establish that the problem of deciding whether a shortest reset word has length k is complete for the complexity class DP. This result answers a question posed by Volkov. For the search problems of finding a shortest reset word and the length of a shortest reset word, we establish membership in the complexity classes FP^NP and FP^NP[log], respectively. Moreover, we show that both these problems are hard for FP^NP[log]. Finally, we observe that computing a reset word of a given length is FNP-complete."
finite-state-machine  statistics  computational-mechanics  modeling  optimization  computational-complexity  nudge-targets 
june 2010 by Vaguery
[1006.4968] Validation of credit default probabilities via multiple testing procedures
"We apply multiple testing procedures to the validation of estimated default probabilities in credit rating systems. The goal is to identify rating classes for which the probability of default is estimated inaccurately, while still maintaining a predefined level of committing type I errors as measured by the familywise error rate (FWER) and the false discovery rate (FDR). For FWER, we also consider procedures that take possible discreteness of the data resp. test statistics into account. The performance of these methods is illustrated in a simulation setting and for empirical default data."
finance  prediction  data-mining  models  statistics  machine-learning  nudge-targets 
june 2010 by Vaguery
[1006.5273] Linear Detrending Subsequence Matching in Time-Series Databases
"Each time-series has its own linear trend, the directionality of a timeseries, and removing the linear trend is crucial to get the more intuitive matching results. Supporting the linear detrending in subsequence matching is a challenging problem due to a huge number of possible subsequences. In this paper we define this problem the linear detrending subsequence matching and propose its efficient index-based solution. To this end, we first present a notion of LD-windows (LD means linear detrending), which is obtained as follows: we eliminate the linear trend from a subsequence rather than each window itself and obtain LD-windows by dividing the subsequence into windows. Using the LD-windows we then present a lower bounding theorem for the index-based matching solution and formally prove its correctness.…"
time-series  data-mining  data-analysis  prediction  statistics  nudge-targets 
june 2010 by Vaguery
[1006.3246] Sparse approaches for the exact distribution of patterns in long multi-states sequences generated by a Markov source
"We present two novel approaches for the computation of the exact distribution of a pattern in a long sequence. Both approaches take into account the sparse structure of the problem. The first approach relies on a partial recursion computing the largest eigenvalue of the the transition matrix of a Markov chain embedding. The second approach uses fast Taylor expansions of an exact bivariate rational reconstruction of the distribution. We illustrate the interest of both approaches on a simple toy-example and two biological applications: the transcription factors of the Human Chromosome 5 and the PROSITE signatures of functional motifs in proteins. On these examples our methods demonstrate their complementarity and their hability to extend the domain of feasibility for exact computations in pattern problems to a new level."
bioinformatics  nudge-targets  sequences  statistics  models  computational-mechanics  automata 
june 2010 by Vaguery
[0911.4729] Hearing the clusters in a graph: A distributed algorithm
"We propose a novel distributed algorithm to cluster graphs. The algorithm recovers the solution obtained from spectral clustering without the need for expensive eigenvalue/vector computations. We prove that, by propagating waves through the graph, a local fast Fourier transform yields the local component of every eigenvector of the Laplacian matrix, which are used to cluster graphs. For large graphs, the proposed algorithm is orders of magnitude faster than random walk based approaches. We prove the equivalence of the proposed algorithm to spectral clustering and derive convergence rates. We also demonstrate the benefit of using this decentralized clustering algorithm to accelerate distributed estimation for sensor networks and for efficient computation of distributed multi-agent search strategies."
network-theory  graph-theory  clustering  algorithms  numerical-methods  statistics  nudge-targets 
june 2010 by Vaguery
[1006.4330] Large gaps imputation in remote sensed imagery of the environment
"Imputation of missing data in large regions of satellite imagery is necessary when the acquired image has been damaged by shadows due to clouds, or information gaps produced by sensor failure.
The general approach for imputation of missing data, that could not be considered missed at random, suggests the use of other available data. Previous work, like local linear histogram matching, take advantage of a co-registered older image obtained by the same sensor, yielding good results in filling homogeneous regions, but poor results if the scenes being combined have radical differences in target radiance due, for example, to the presence of sun glint or snow.…"
nudge-targets  definitely-nudge-targets  imputation  statistics  machine-learning  data-analysis 
june 2010 by Vaguery
[1006.4354] Empirical Modeling of Radiative versus Magnetic Flux for the Sun-as-a-Star
"…We find that a well-defined temporal component exists and accounts for some of the variance in the data. This temporal component arises because active regions with high magnetic field strength evolve, breaking up into small-scale magnetic elements with low field strength, and radiative and magnetic fluxes are sensitive to different active-region components. We generate empirical models that relate radiative flux to magnetic flux, allowing us to predict spectral-irradiance variations from observations of disk-averaged magnetic-flux density. In most cases, the model reconstructions can account for 85-90% of the variability of the radiative flux from the chromosphere and corona. Our results are important for understanding the relationship between magnetic and radiative measures of solar and stellar variability."
astronomy  astrophysics  modeling  learning-from-data  statistics  nudge-targets 
june 2010 by Vaguery
[1006.3128] Fundamental Tradeoffs for Sparsity Pattern Recovery
"Recovery of the sparsity pattern (or support) of a sparse vector from a small number of noisy linear samples is a common problem that arises in signal processing and statistics. In the high dimensional setting, it is known that recovery with a vanishing fraction of errors is impossible if the sampling rate and per-sample signal-to-noise ratio (SNR) are finite constants independent of the length of the vector. In this paper, it is shown that recovery with an arbitrarily small but constant fraction of errors is, however, possible, and that in some cases a computationally simple thresholding estimator is near-optimal.…"
signal-processing  nudge-targets  information-theory  communication  numerical-methods  statistics  algorithms  approximation  heuristics 
june 2010 by Vaguery
[0902.0600] Decisional States
"…The intrinsic underlying structure of the system is modeled by an epsilon-machine and its causal states. The decisional states are the emerging patterns corresponding to the utility function. In a complex systems perspective, these patterns thus form a partition of the lower-level system states that is defined according to the higher-level user's knowledge. The transitions between these decisional states correspond to events that lead to a change of decision. An algorithm is provided so as to estimate the states and their transitions from data. Application examples are given for hidden model reconstruction, cellular automata filtering, and edge detection in images."
computational-mechanics  information-theory  prediction  statistics  probability-theory  machine-learning  classification 
june 2010 by Vaguery
[1006.1346] C-HiLasso: A Collaborative Hierarchical Sparse Modeling Framework
"Sparse modeling is a powerful framework for data analysis and processing. Traditionally, encoding in this framework is performed by solving an L1-regularized linear regression problem, commonly referred to as Lasso or Basis Pursuit. In this work we combine the sparsity-inducing property of the Lasso model at the individual feature level, with the block-sparsity property of the Group Lasso model, where sparse groups of features are jointly encoded, obtaining a sparsity pattern hierarchically structured. This results in the Hierarchical Lasso (HiLasso), which shows important practical modeling advantages.…"
numerical-methods  statistics  learning-from-data  machine-learning  image-processing  image-segmentation  nudge-targets 
june 2010 by Vaguery
[1006.1328] Uncovering the Riffled Independence Structure of Rankings
"… In this paper, we provide a formal introduction to riffled independence and present algorithms for using riffled independence within Fourier-theoretic frameworks which have been explored by a number of recent papers. Additionally, we propose an automated method for discovering sets of items which are riffle independent from a training set of rankings. We show that our clustering-like algorithms can be used to discover meaningful latent coalitions from real preference ranking datasets and to learn the structure of hierarchically decomposable models based on riffled independence."
statistics  ranking  clustering  data-envelopment-analysis  multiobjective-optimization  nudge  numerical-methods 
june 2010 by Vaguery
[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful.…Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
clustering  algorithms  statistics  models  classification  learning-from-data 
june 2010 by Vaguery
[1006.3342] Local polynomial regression and variable selection
will I ever understand all the effort statisticians put into what I consider a solved problem? Pareto-GP is apparently utterly unknown, still
statistics  models-and-modes  modeling-is-not-mathematics  algorithms  regression  variable-selection  genetic-programming-target 
june 2010 by Vaguery
[0907.5236] A Discussion on Mean Excess Plots
"A widely used tool in the study of risk, insurance and extreme values is the mean excess plot. One use is for validating a generalized Pareto model for the excess distribution. This paper investigates some theoretical and practical aspects of the use of the mean excess plot."
modeling  statistics  visualization  review  operations-research  extreme-values 
june 2010 by Vaguery
[0812.3141] Choosing a penalty for model selection in heteroscedastic regression
"We consider the problem of choosing between several models in least-squares regression with heteroscedastic data. We prove that any penalization procedure is suboptimal when the penalty is a function of the dimension of the model, at least for some typical heteroscedastic model selection problems. In particular, Mallows' Cp is suboptimal in this framework. On the contrary, optimal model selection is possible with data-driven penalties such as resampling or $V$-fold penalties. Therefore, it is worth estimating the shape of the penalty from data, even at the price of a higher computational cost. Simulation experiments illustrate the existence of a trade-off between statistical accuracy and computational complexity. As a conclusion, we sketch some rules for choosing a penalty in least-squares regression, depending on what is known about possible variations of the noise-level."
statistics  statistical-tests  linear-regression  meta-optimization  nudge-targets  multiobjective-optimization  pragmatism-it-ain't 
june 2010 by Vaguery
[1006.2307] Exploring the randomness of Directed Acyclic Networks
"The feed-forward relationship naturally observed in time-dependent processes and in a diverse number of real systems -such as some food-webs and electronic and neural wiring- can be described in terms of so-called directed acyclic graphs (DAGs). An important ingredient of the analysis of such networks is a proper comparison of their observed architecture against an ensemble of randomized graphs, thereby quantifying the {\em randomness} of the real systems with respect to suitable null models. This approximation is particularly relevant when the finite size and/or large connectivity of real systems make inadequate a comparison with the predictions obtained from the so-called {\em configuration model}. In this paper we analyze four methods of DAG randomization as defined by the desired combination of topological invariants (directed and undirected degree sequence and component distributions) aimed to be preserved.…"
networks  network-theory  graph-theory  algorithms  statistics  complexology  theoretical-biology 
june 2010 by Vaguery
[1006.0849] Reconstruction of Causal Networks by Set Covering
"We present a method for the reconstruction of networks, based on the order of nodes visited by a stochastic branching process. Our algorithm reconstructs a network of minimal size that ensures consistency with the data. Crucially, we show that global consistency with the data can be achieved through purely local considerations, inferring the neighbourhood of each node in turn. The optimisation problem solved for each individual node can be reduced to a Set Covering Problem, which is known to be NP-hard but can be approximated well in practice. We then extend our approach to account for noisy data, based on the Minimum Description Length principle. We demonstrate our algorithms on synthetic data, generated by an SIR-like epidemiological model."
network-theory  modeling  statistics  learning-from-data  learning-by-doing  algorithms  nudge-targets 
june 2010 by Vaguery
[1006.0764] General Purpose Convolution Algorithm in S4-Classes by means of FFT
"Object orientation provides a flexible framework for the implementation of the convolution of arbitrary distributions of real-valued random variables.
We discuss an algorithm which is based on the Discrete Fourier Transformation and its fast computability via the Fast Fourier Transformation. It directly applies to lattice-supported distributions. In the case of continuous distributions an additional discretization to a linear lattice is necessary and the resulting lattice-supported distributions are suitably smoothed after convolution."
statistics  R  library  probability-theory  libraries  open-source  nudge 
june 2010 by Vaguery
What is data science? - O'Reilly Radar
"We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data?

In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets."
data-analysis  data-mining  learning-from-data  statistics  futurism  drinking-from-the-firehose  nudge  via:tsuomela 
june 2010 by Vaguery
[0908.2503] Sequential Quantile Prediction of Time Series
"Motivated by a broad range of potential applications, we address the quantile prediction problem of real-valued time series. We present a sequential quantile forecasting model based on the combination of a set of elementary nearest neighbor-type predictors called "experts" and show its consistency under a minimum of conditions. Our approach builds on the methodology developed in recent years for prediction of individual sequences and exploits the quantile structure as a minimizer of the so-called pinball loss function. We perform an in-depth analysis of real-world data sets and show that this nonparametric strategy generally outperforms standard quantile prediction methods"
time-series  prediction  models  statistics  nudge-targets  learning-from-data  machine-learning 
june 2010 by Vaguery
[1005.4358] On the estimation of the extremal index based on scaling and resampling
"The extremal index parameter theta characterizes the degree of local dependence in the extremes of a stationary time series and has important applications in a number of areas, such as hydrology, telecommunications, finance and environmental studies.…Further, a procedure for the automatic selection of its tuning parameter is developed and different types of confidence intervals that prove useful in practice proposed. The performance of the estimator is examined through simulations, which show its highly competitive behavior. Finally, the estimator is applied to three real data sets of daily crude oil prices, daily returns of the S&P 500 stock index, and high-frequency, intra-day traded volumes of a stock. These applications demonstrate additional diagnostic features of statistical plots based on the new estimator."
statistics  time-series  statistical-tests  nudge-targets  algorithms  extreme-values 
may 2010 by Vaguery
[1005.4274] This is SPIRAL-TAP: Sparse Poisson Intensity Reconstruction ALgorithms - Theory and Practice
"The optimization formulation considered in this paper uses a penalized negative Poisson log-likelihood objective function with nonnegativity constraints (since Poisson intensities are naturally nonnegative). In particular, the proposed approach incorporates key ideas of using separable quadratic approximations to the objective function at each iteration and penalization terms related to l1 norms of coefficient vectors, total variation seminorms, and partition-based multiscale estimation methods."
optimization  models  statistics  algorithms  image-processing  image-analysis  umlauts 
may 2010 by Vaguery
[1005.3680] Quantifying long-range correlations in complex networks beyond nearest neighbors
"We propose a fluctuation analysis to quantify spatial correlations in complex networks. The approach considers the sequences of degrees along shortest paths in the networks and quantifies the fluctuations in analogy to time series. In this work, the Barabasi-Albert (BA) model, the Cayley tree at the percolation transition, a fractal network model, and examples of real-world networks are studied. While the fluctuation functions for the BA model show exponential decay, in the case of the Cayley tree and the fractal network model the fluctuation functions display a power-law behavior. The fractal network model comprises long-range anti-correlations. The results suggest that the fluctuation exponent provides complementary information to the fractal dimension."
complexology  network-theory  physics  statistics 
may 2010 by Vaguery
Random matrices in the news : Applied Statistics
"Now, to return to the news article. If the eigenvalue distribution is an attractor, this means that a lot of physical and social phenomena which can be modeled by eigenvalues (including, apparently, quantum energy levels and some properties of statistical tests) might have a common structure. Just as, at a similar level, we see the normal distribution and related functions in all sorts of unusual places."
random-matrix  statistics  complexology  physics  applied-mathematics  universality 
may 2010 by Vaguery
[1005.2715] On the Subspace of Image Gradient Orientations
"We introduce the notion of Principal Component Analysis (PCA) of image gradient orientations. As image data is typically noisy, but noise is substantially different from Gaussian, traditional PCA of pixel intensities very often fails to estimate reliably the low-dimensional subspace of a given data population. We show that replacing intensities with gradient orientations and the $\ell_2$ norm with a cosine-based distance measure offers, to some extend, a remedy to this problem.…"
image-processing  signal-processing  image-analysis  machine-learning  statistics  PCA  nudge-targets 
may 2010 by Vaguery
[1005.2979] Robust and Adaptive Algorithms for Online Portfolio Selection
"… Our methods use simple ideas from signal processing and statistics, which are sometimes overlooked in the empirical financial literature. The two approaches are evaluated against benchmark allocation techniques using 4 real datasets. Our methods outperform the benchmark allocation techniques in these datasets, in terms of both computational demand and financial performance."
trading  financial-engineering  stocks  machine-learning  statistics  algorithms  portfolio-theory 
may 2010 by Vaguery
[0906.4779] Minimum Probability Flow Learning
"Learning in probabilistic models is often hampered by the general intractability of the normalization factor and its derivatives. Here we propose a new learning technique that obviates the need to compute an intractable normalization factor or sample from the equilibrium distribution of the model. This is achieved by establishing dynamics that would transform the observed data distribution into the model distribution, and then setting as the objective the minimization of the initial flow of probability away from the data distribution.…"
learning-from-data  statistics  machine-learning  estimation  algorithms  to-understand 
may 2010 by Vaguery
[0905.0917] Determining interaction rules in animal swarms
"In this paper we introduce a method for determining local interaction rules in animal swarms. The method is based on the assumption that the behavior of individuals in a swarm can be treated as a set of mechanistic rules.
The principal idea behind the technique is to vary parameters that define a set of hypothetical interactions to minimize the deviation between the forces estimated from observed animal trajectories and the forces resulting from the assumed rule set. We demonstrate the method by reconstructing the interaction rules from the trajectories produced by a computer simulation."
inverse-problems  agent-based  boids  nudge-targets  statistics  model-discovery 
may 2010 by Vaguery
[0912.1567] Quantifying the Ease of Scientific Discovery
"It has long been known that scientific output proceeds on an exponential increase, or more properly, a logistic growth curve. The interplay between effort and discovery is clear, and the nature of the functional form has been thought to be due to many changes in the scientific process over time. Here I show a quantitative method for examining the ease of scientific progress, another necessary component in understanding scientific discovery. Using examples from three different scientific disciplines - mammalian species, chemical elements, and minor planets - I find the ease of discovery to conform to an exponential decay. In addition, I show how the pace of scientific discovery can be best understood as the outcome of both scientific output and ease of discovery."
science  arrival-times  statistics  innovation  empirical-economics  applicable-to-genetic-programming  metering 
may 2010 by Vaguery
[1005.0182] A Multi Agent Model for the Limit Order Book Dynamics
"In the present work we introduce a novel multi-agent model with the aim to reproduce the dynamics of a double auction market at microscopic time scale through a faithful simulation of the matching mechanics in the limit order book. The model follows a "zero intelligence" approach where the actions of the traders are related to a stochastic variable, the market sentiment, which we define as a mixture of public and private information. The model, despite the parsimonious approach, is able to reproduce several empirical features of the high-frequency dynamics of the market microstructure not only related to the price movements but also to the deposition of the orders in the book."
modeling  agent-based  finance  markets  simulation  algorithms  statistics 
may 2010 by Vaguery
[1005.2197] Scalable Tensor Factorizations for Incomplete Data
"Our numerical studies suggest that the proposed CP-WOPT approach is accurate and scalable. CP-WOPT can recover the underlying factors successfully with large amounts of missing data, e.g., 90% missing entries for tensors of size 50 × 40 × 30. We have also studied how CP-WOPT can scale to problems of larger sizes, e.g., 1000 × 1000 × 1000, and recover CP factors from large, sparse tensors with 99.5% missing data.…"
statistics  numerical-methods  missing-data  scientific-computing  algorithms 
may 2010 by Vaguery
[1005.2314] Some comments on C. S. Wallace's random number generators
"Although care needs to be taken in the implementation of normal random number generators like fastnorm, and the end-user should be aware of the small but unavoidable defects discussed in §§5.6-5.7, these generators have such a performance advantage over more conventional generators that they can not be ignored in applications where the speed of generation of pseudo- random numbers is critical."
nudge-targets  pseudorandom-numbers  algorithms  statistics  computer-science  numerical-methods 
may 2010 by Vaguery
[1005.0660] The Significant Digit Law in Statistical Physics
"The occurrence of the nonzero leftmost digit, i.e., 1, 2, ..., 9, of numbers from many real world sources is not uniformly distributed as one might naively expect, but instead, the nature favors smaller ones according to a logarithmic distribution, named Benford's law. We investigate three kinds of widely used physical statistics, i.e., the Boltzmann-Gibbs (BG) distribution, the Fermi-Dirac (FD) distribution, and the Bose-Einstein (BE) distribution, and find that the BG and FD distributions both fluctuate slightly in a periodic manner around the Benford distribution with respect to the temperature of the system, while the BE distribution conforms to it exactly whatever the temperature is. Thus the Benford's law seems to present a general pattern for physical statistics and might be even more fundamental and profound in nature. Furthermore, various elegant properties of Benford's law, especially the mantissa distribution of data sets, are discussed."
Benford's-law  mysteries-of-the-universe  number-theory  statistics  WTF 
may 2010 by Vaguery
[1005.1327] Statistical Model Checking : An Overview
"Quantitative properties of stochastic systems are usually specified in logics that allow one to compare the measure of executions satisfying certain temporal properties with thresholds. The model checking problem for stochastic systems with respect to such logics is typically solved by a numerical approach that iteratively computes (or approximates) the exact measure of paths satisfying relevant subformulas; the algorithms themselves depend on the class of systems being analyzed as well as the logic used for specifying the properties. Another approach to solve the model checking problem is to \emph{simulate} the system for finitely many runs, and use \emph{hypothesis testing} to infer whether the samples provide a \emph{statistical} evidence for the satisfaction or violation of the specification. In this short paper, we survey the statistical approach, and outline its main advantages in terms of efficiency, uniformity, and simplicity."
complexology  simulation  statistics  models  modeling-is-not-mathematics  inference  explanatory-power 
may 2010 by Vaguery
« earlier      

related tags

a-rose-of-any-other-size  academia  academic  academic-culture  advice  agent-based  agile  agility  AIC  ain't-performance-space  algorithms  American  American-culture  analysis  analytics  Ann-Arbor  annotation  anomalies  applicable-to-genetic-programming  applications  applied-mathematics  approximation  architecture  archive  argumentation  arrival-times  assumptions  astronomy  astrophysics  astroturf  audio-segmentation  auditing  authority  automata  automation  bad-design  bars  Bayesian  Bayesianism  benchmarking  Benford's-law  bibliography  big-data-will-lead-to-big-inference  binding  biochemistry  bioinformatics  biology  boids  book  business  business-culture  business-model  business-plan  cause-and-effect  census  CFP  challenges  Chris-Anderson  citation  classification  clustering  cognitive-psychology  collaboration  communication  communities-of-practice  community  comping  complex-systems  complexology  compressed-sensing  computational-complexity  computational-linguistics  computational-mechanics  computer-science  computing  conferences  consulting  consumerism  contingency-of-all-models  coordination  Cosma-R-Shalizi  credentials  criticism-is-the-best-medicine  crowdsourcing  crystallography  cultural-norms  cure-for-dimensionality  data  data-analysis  data-collection  data-envelopment-analysis  data-mining  data-science  databases  definitely-nudge-targets  del.icio.us  demographics  design  design-of-measures  development  digitization  distance  distributed-processing  documentation  drinking-from-the-firehose  dynamic  dynamical-systems  economic-crisis  economics  empirical-economics  employment  emplyment  engineering  epidemiology  error  estimation  ethology  evidence  evidence-based  examples  exercise  expense  experiment  experimentation  explanation  explanatory-power  exploratory-data-analysis  extension  extreme-values  false-positives-false-negatives-and-other  false-quants  fast-food  fat-data  FDA  finance  financial-crisis  financial-engineering  finite-state-machine  firms  first-principles  fitness  folk-understanding  forecasting  free  freeware  frequentism  functional-data-analysis  futurism  genetic-programming  genetic-programming-target  genetics  geography  ggplot2  go-for-the-header  goodness-of-fit  government  graph  graph-theory  graphic-design  graphics  graphs  gullibility  habits  healthcare  heuristics  hiring  histograms  how-to  hubris  hyperbole  hypothesis-testing  i-could-do-that  I-guess  i-need-the-name-for-this  ignorance  image-analysis  image-processing  image-segmentation  imagemagick  imputation  inference  information-theory  infrastructure  innovation  instructions  interestingness  interoperability  introduction  introductory  inverse-problems  investment  it's-the-great-plains-in-winter-you-decide  jobs  journalism  journals  law  learning  learning-by-doing  learning-by-watching  learning-from-data  libraries  library  linear-regression  linguistics  literacy  local  logic  long-depression  machine-learning  MacOS  magazines  manuscripts  map  MapReduce  market  marketing  markets  mathematics  media  medical-culture  medicine  meta-analysis  meta-optimization  metaheuristics  metamodeling  metaoptimization  metaphors  metering  methodologies  methods  metrics  misapplied-statistics  missing-data  model-discovery  modeling  modeling-is-not-mathematics  models  models-and-modes  more-marketing  multiobjective-optimization  mysteries-of-the-universe  natural-language-processing  network-theory  networks  NLP  no-really  nonemployer  nonparametric-methods  nonparametric-statistics  not-an-employee  notanemployee  nudge  nudge-targets  number-theory  numerical  numerical-methods  objectivity  OCR  online-learning  ontology  open-sc  open-science  open-source  openness  operations-research  optimization  p-values  paper  papers  pattern-discovery  PCA  pedagogy  peer-review  performance  performance-measure  pharmaceutical  philosophy  phylogenetics  physical-anthropology  physics  physics-envy  planning  policy  politics  polling  popularization  portfolio-theory  positive-feedback  power-law  pragmatism  pragmatism-it-ain't  prediction  prejudice  preprint  pretty  printing  probability  probability-theory  problem-solving  proceedings  programming  project  promotion  propaganda  propensity  pseudorandom-numbers  psychology  psychometrics  public-policy  publishing  Python  R  R-language  race  racism  Rails  random-matrix  ranking  RApache  rationality  raw-data-soon  reasoning  received-wisdom  recession  recommendations  reference  regression  reporting  research  resilience  restaurants  results  review  reviews  rights  risk  robustness  RoR  rsRuby  ruby  rubygem  sales  science  science2.0  scientific-computing  scientific-model-fallacies  scripting  search-engines  security  sequences  service  signal-processing  significance  simulation  small-business  smartmobs  social  social-engineering  social-networks  social-norms  social-sciences  sociology  software  standard-setting-play  statistical-tests  statisticians-don't-do-Pragmatism-well  statistics  stocks  storytelling  structural-biology  structure  stupidity  summary  symbolic-regression  sysadmin  system-administration  tacit-knowledge  tagging  taxonomy  technical  techniques  television  testing  text-mining  that-Greek-dude-with-the-wings-that-melted  the-world-doesn't-give-a-damn-what-stories-we-tell-about-it  theoretical-biology  theory  theory-and-practice-sitting-in-a-tree  thesis  things-to-ask-Cosma-about  time-series  to-read  to-understand  tools  trading  trained-incapacity  training  trends  tutorial  umlauts  unemployment  universality  user-experience  validation  variable-selection  via:?  via:arsyed  via:arthegall  via:cshalizi  via:dunrie  via:jhofman  via:mark.larios  via:tsuomela  visualization  web  web-design  web2.0  when-in-Roma  why-does-it-take-26-pages-of-maths-before-we-try-it?  wikipedia  wisdom-of-crowds  Workantile  worklife  worldviews  writing  wrong  WTF 

Copy this bookmark:



description:


tags: