Vaguery + learning-from-data   64

[1111.3304] Eigenvector Synchronization, Graph Rigidity and the Molecule Problem
"The graph realization problem has received a great deal of attention in recent years, due to its importance in applications such as wireless sensor networks and structural biology.…"
algorithms  statistics  structure  learning-from-data  nudge-targets 
11 weeks ago by Vaguery
[1201.5568] Dynamic trees for streaming and massive data contexts
"Data collection at a massive scale is becoming ubiquitous in a wide variety of settings, from vast offline databases to streaming real-time information. Learning algorithms deployed in such contexts must rely on single-pass inference, where the data history is never revisited. In streaming contexts, learning must also be temporally adaptive to remain up-to-date against unforeseen changes in the data generating mechanism. Although rapidly growing, the online Bayesian inference literature remains challenged by massive data and transient, evolving data streams. Non-parametric modelling techniques can prove particularly ill-suited, as the complexity of the model is allowed to increase with the sample size. In this work, we take steps to overcome these challenges by porting standard streaming techniques, like data discarding and downweighting, into a fully Bayesian framework via the use of informative priors and active learning heuristics. We showcase our methods by augmenting a modern non-parametric modelling framework, dynamic trees, and illustrate its performance on a number of practical examples. The end product is a powerful streaming regression and classification tool, whose performance compares favourably to the state-of-the-art."
data-analysis  learning-from-data  algorithms  drinking-from-the-firehose  nudge  data-mining 
january 2012 by Vaguery
[1109.2618] Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning
We introduce a machine learning model to predict atomization energies of a diverse set of organic molecules, based on nuclear charges and atomic positions only. The problem of solving the molecular Schr"odinger equation is mapped onto a non-linear statistical regression problem of reduced complexity. Regression models are trained on and compared to atomization energies computed with hybrid density-functional theory. Cross-validation over more than seven thousand small organic molecules yields a mean absolute error of ~10 kcal/mol. Applicability is demonstrated for the prediction of molecular atomization potential energy curves.
machine-learning  learning-from-data  biochemistry  computational-science  nudge-targets 
january 2012 by Vaguery
[1109.3248] Reconstruction of sequential data with density models
We introduce the problem of reconstructing a sequence of multidimensional real vectors where some of the data are missing. This problem contains regression and mapping inversion as particular cases where the pattern of missing data is independent of the sequence index. The problem is hard because it involves possibly multivalued mappings at each vector in the sequence, where the missing variables can take more than one value given the present variables; and the set of missing variables can vary from one vector to the next. To solve this problem, we propose an algorithm based on two redundancy assumptions: vector redundancy (the data live in a low-dimensional manifold), so that the present variables constrain the missing ones; and sequence redundancy (e.g. continuity), so that consecutive vectors constrain each other. We capture the low-dimensional nature of the data in a probabilistic way with a joint density model, here the generative topographic mapping, which results in a Gaussian mixture. Candidate reconstructions at each vector are obtained as all the modes of the conditional distribution of missing variables given present variables. The reconstructed sequence is obtained by minimising a global constraint, here the sequence length, by dynamic programming. We present experimental results for a toy problem and for inverse kinematics of a robot arm.
inverse-problems  statistics  algorithms  learning-from-data  nudge-targets 
january 2012 by Vaguery
[1112.5794] BATMAN-an R package for the automated quantification of metabolites from NMR spectra using a Bayesian Model
Motivation: NMR spectra are widely used in metabolomics to obtain metabolite profiles in complex biological mixtures. Common methods used to assign and estimate concentrations of metabolite involve either an expert manual peak fitting or extra pre-processing steps, such as peak alignment and binning. Peak fitting is very time consuming and is subject to human error. Conversely, alignment and binning can introduce artifacts and limit immediate biological interpretation of models. Results: We present the Bayesian AuTomated Metabolite Analyser for NMR spectra (BATMAN), an R package which deconvolves peaks from 1-dimensional NMR spectra, automatically assigns them to specific metabolites and obtains concentration estimates. The Bayesian model incorporates information on characteristic peak patterns of metabolites and is able to account for shifts in the position of peaks commonly seen in NMR spectra of biological samples. It applies a Markov Chain Monte Carlo (MCMC) algorithm to sample from a joint posterior distribution of the model parameters and obtains concentration estimates with reduced mean estimation error compared with conventional numerical integration methods.
learning-from-data  statistics  modeling  biochemistry  nudge-targets  image-segmentation 
january 2012 by Vaguery
[1105.2584] Workload Classification & Software Energy Measurement for Efficient Scheduling on Private Cloud Platforms
"At present there are a number of barriers to creating an energy efficient workload scheduler for a Private Cloud based data center. Firstly, the relationship between different workloads and power consumption must be investigated. Secondly, current hardware-based solutions to providing energy usage statistics are unsuitable in warehouse scale data centers where low cost and scalability are desirable properties. In this paper we discuss the effect of different workloads on server power consumption in a Private Cloud platform. We display a noticeable difference in energy consumption when servers are given tasks that dominate various resources (CPU, Memory, Hard Disk and Network). We then use this insight to develop CloudMonitor, a software utility that is capable of >95% accurate power predictions from monitoring resource consumption of workloads, after a "training phase" in which a dynamic power model is developed."
operations-research  cloud-computing  system-administration  learning-from-data  nudge-targets 
october 2011 by Vaguery
[1107.0674] "Memory foam" approach to unsupervised learning
"We propose an alternative approach to construct an artificial learning system, which naturally learns in an unsupervised manner. Its mathematical prototype is a dynamical system, which automatically shapes its vector field in response to the input signal. The vector field converges to a gradient of a multi-dimensional probability density distribution of the input process, taken with negative sign. The most probable patterns are represented by the stable fixed points, whose basins of attraction are formed automatically. The performance of this system is illustrated with musical signals."
machine-learning  classification  learning-from-data  algorithms  nudge-targets 
august 2011 by Vaguery
[1107.0550] 3D Terrestrial LiDAR data classification of complex natural scenes using a multi-scale dimensionality criterion: applications in geomorphology
"3D point clouds of natural environments relevant to geomorphology problems (rivers, cliffs...) often require to classify the data into elementary relevant classes. A typical example is the separation of riparian vegetation from soil in fluvial environments, the distinction between fresh surfaces and rockfall in cliff environments, or more generally the classification of surfaces according to their morphology (ripples, grain size...). Natural surfaces are very heterogeneous and their distinctive properties are seldom defined at a unique scale. We have thus defined a multi-scale measure of the point cloud dimensionality around each point. The dimensionality characterizes the local 3D organization of the point cloud and varies from being 1D (points set along a line) to really taking all 3D volume, at each scale. We present the technique and illustrate its efficiency in separating riparian vegetation from ground and classifying a mountain stream in vegetation, rock, gravel and water surface. The superiority of the multi-scale analysis in enhancing class separability and spatial resolution of the classification is also demonstrated. Large scenes can be classified on a commodity laptop in a reasonable time. The technique is robust to missing data and especially shadow zones. The classification is fast and accurate and can account for some degree of intra-class morphological variability such as different vegetation types. A probabilistic confidence in the classification result is given at each point allowing the user to remove the points for which the classification is uncertain. The process can be both fully automated but also fully customized by the user including a graphical definition of the classifiers if so desired. Although developed for fully 3D data, the method can be readily applied to 2.5D airborne LiDAR data."
image-analysis  image-segmentation  learning-from-data  classification  nudge-targets 
august 2011 by Vaguery
Language Log » Straw men and Bee Science
"Let me start by saying that there's a way to take all this that makes it entirely correct. The key motive of science is explanation, and it's often essential to abstract away from the complexities of raw observation, and so on. I took courses from Chomsky as an undergraduate and a graduate student, and I'm grateful for what I learned from him, and for the eminently fair way that he always treated me. But increasingly, it seems to me, he has been elevating his personal distaste for the complexities of the real world into a systematic philosophy. To the extent that others accept these views, it excludes them from participation in (what I think are) the most promising and exciting current directions in the sciences of speech and language."
Noam-Chomsky  theory-and-practice-sitting-in-a-tree  bias  science  learning-from-data 
june 2011 by Vaguery
Falkenblog: High Frequency Trading Paper
"The point is that in fast moving markets, one needs something a little better than simple historical moving averages of daily closing prices. This is better, and extending the idea of 'volume time' vs. 'chronological time' is an intriguing direction. But one can also look at bid-ask spreads directly, or the VIX futures, or its etf, the VXX, and combinations, to gauge intraday volatility as well. Further, one can better estimate 'buy volume' using the transaction price relative to the then extant bid-ask spread, rather than if the price was weakly increasing, though this then involves syncing the trade information with quote information, and for academics such data are often hard to come by (further, quote information is often 10 times as large)."
learning-from-data  financial-engineering  trading  analytics  nudge-targets 
june 2011 by Vaguery
The distribution of interestingness | (R news & tutorials)
"The longer – and far less satisfying – answer to the question of how interestingness measures should be distributed is, “it depends,” as the following discussion illustrates."
statistics  interestingness  design-of-measures  statisticians-don't-do-Pragmatism-well  learning-from-data 
may 2011 by Vaguery
Evolved Analytics' DataModeler | Evolved Analytics
The technology has been developed to withstand the challenges of real world — in addition to handling problems of too much data, too little data, correlated data, or noisy data, DataModeler respects the cost and timeliness issues associated with modeling development.
evolutionary-algorithms  genetic-programming  learning-from-data  Mathematica 
may 2011 by Vaguery
[1008.1663] Learning Residual Finite-State Automata Using Observation Tables
"We define a two-step learner for RFSAs based on an observation table by using an algorithm for minimal DFAs to build a table for the reversal of the language in question and showing that we can derive the minimal RFSA from it after some simple modifications. We compare the algorithm to two other table-based ones of which one (by Bollig et al. 2009) infers a RFSA directly, and the other is another two-step learner proposed by the author. We focus on the criterion of query complexity."
finite-state-machine  machine-learning  algorithms  nudge-targets  learning-from-data  inference 
august 2010 by Vaguery
[1003.0470] Unsupervised Supervised Learning II: Training Margin Based Classifiers without Labels
"On a more philosophical level, our approach points at novel questions that go beyond supervised and semi-supervised learning. What benefit do labels provide over unsupervised training? Can our framework be extended to semi-supervised learning where a few labels do exist? Can it be extended to non-classification scenarios such as margin based regression or margin based structured prediction? When are the assumptions likely to hold and how can we make our framework even more resistant to deviations from them? These questions and others form new and exciting open research directions."
unsupervised-learning  supervised-learning  learning-from-data  machine-learning  regression  modeling 
august 2010 by Vaguery
[0912.4473] Learning to Predict Combinatorial Structures
"The major challenge in designing a discriminative learning algorithm for predicting structured data is to address the computational issues arising from the exponential size of the output space. Existing algorithms make different assumptions to ensure efficient, polynomial time estimation of model parameters. For several combinatorial structures, including cycles, partially ordered sets, permutations and other graph classes, these assumptions do not hold. In this thesis, we address the problem of designing learning algorithms for predicting combinatorial structures by introducing two new assumptions: (i) The first assumption is that a particular counting problem can be solved efficiently. The consequence is a generalisation of the classical ridge regression for structured prediction. (ii) The second assumption is that a particular sampling problem can be solved efficiently. …"
machine-learning  prediction  combinatorics  nudge-targets  learning-from-data 
june 2010 by Vaguery
[1006.4354] Empirical Modeling of Radiative versus Magnetic Flux for the Sun-as-a-Star
"…We find that a well-defined temporal component exists and accounts for some of the variance in the data. This temporal component arises because active regions with high magnetic field strength evolve, breaking up into small-scale magnetic elements with low field strength, and radiative and magnetic fluxes are sensitive to different active-region components. We generate empirical models that relate radiative flux to magnetic flux, allowing us to predict spectral-irradiance variations from observations of disk-averaged magnetic-flux density. In most cases, the model reconstructions can account for 85-90% of the variability of the radiative flux from the chromosphere and corona. Our results are important for understanding the relationship between magnetic and radiative measures of solar and stellar variability."
astronomy  astrophysics  modeling  learning-from-data  statistics  nudge-targets 
june 2010 by Vaguery
The Berkeley Segmentation Dataset and Benchmark
"The goal of this work is to provide an empirical basis for research on image segmentation and boundary detection. To this end, we have collected 12,000 hand-labeled segmentations of 1,000 Corel dataset images from 30 human subjects. Half of the segmentations were obtained from presenting the subject with a color image; the other half from presenting a grayscale image. The public benchmark based on this data consists of all of the grayscale and color segmentations for 300 images. The images are divided into a training set of 200 images, and a test set of 100 images."
dataset  learning-from-data  training-set  machine-learning  image-segmentation  image-processing  nudge 
june 2010 by Vaguery
A Peek Into the Future: HFT and Financial News -- Seeking Alpha
"A still more realistic and subtle, but much more troublesome scenario: Financial Undetectable Journalistic Engineering (FUJE). Financial news journalists could word the reports differently and send very different signals to the robot army. Here're two actual news headlines re. the May NFP number (incidentally, both are from the same outlet, same day, different reporter -- just a random google search):

US adds 431,000 jobs in May, unemployment down to 9.7 pct
vs.

Despite Adding 431K Jobs, May Non-Farm Payroll Figures Disappoint
The first is factual; the second contains more in-depth analysis. It takes an experienced human to parse and reconcile the two. You can see how robot readers may assign opposite signs to each."
data-mining  high-frequency-trading  trading  news  learning-from-data  boy-am-I-glad-we-folded-the-startup 
june 2010 by Vaguery
[1006.1346] C-HiLasso: A Collaborative Hierarchical Sparse Modeling Framework
"Sparse modeling is a powerful framework for data analysis and processing. Traditionally, encoding in this framework is performed by solving an L1-regularized linear regression problem, commonly referred to as Lasso or Basis Pursuit. In this work we combine the sparsity-inducing property of the Lasso model at the individual feature level, with the block-sparsity property of the Group Lasso model, where sparse groups of features are jointly encoded, obtaining a sparsity pattern hierarchically structured. This results in the Hierarchical Lasso (HiLasso), which shows important practical modeling advantages.…"
numerical-methods  statistics  learning-from-data  machine-learning  image-processing  image-segmentation  nudge-targets 
june 2010 by Vaguery
[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful.…Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
clustering  algorithms  statistics  models  classification  learning-from-data 
june 2010 by Vaguery
[1005.5636] Astrocladistics: Multivariate Evolutionary Analysis in Astrophysics
"It is now clear that cladistics can be applied and be useful to the study of galaxy diversification. Many difficulties, conceptual and practical, have been solved,. Significant astrophysical results have been obtained and will be extended to larger samples of galaxies and globular clusters. However, many paths remain in the exploration of this new and large field of research."
astronomy  classification  cladistics  inference  nudge-targets  learning-from-data  model-discovery 
june 2010 by Vaguery
[1006.0849] Reconstruction of Causal Networks by Set Covering
"We present a method for the reconstruction of networks, based on the order of nodes visited by a stochastic branching process. Our algorithm reconstructs a network of minimal size that ensures consistency with the data. Crucially, we show that global consistency with the data can be achieved through purely local considerations, inferring the neighbourhood of each node in turn. The optimisation problem solved for each individual node can be reduced to a Set Covering Problem, which is known to be NP-hard but can be approximated well in practice. We then extend our approach to account for noisy data, based on the Minimum Description Length principle. We demonstrate our algorithms on synthetic data, generated by an SIR-like epidemiological model."
network-theory  modeling  statistics  learning-from-data  learning-by-doing  algorithms  nudge-targets 
june 2010 by Vaguery
What is data science? - O'Reilly Radar
"We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data?

In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets."
data-analysis  data-mining  learning-from-data  statistics  futurism  drinking-from-the-firehose  nudge  via:tsuomela 
june 2010 by Vaguery
[1004.3925] Classification using distance nearest neighbours
"This paper proposes a new probabilistic classification algorithm using a Markov random field approach. The joint distribution of class labels is explicitly modelled using the distances between feature vectors. Intuitively, a class label should depend more on class labels which are closer in the feature space, than those which are further away.…"
classification  machine-learning  markov-random-field  algorithms  learning-from-data 
june 2010 by Vaguery
[0908.2503] Sequential Quantile Prediction of Time Series
"Motivated by a broad range of potential applications, we address the quantile prediction problem of real-valued time series. We present a sequential quantile forecasting model based on the combination of a set of elementary nearest neighbor-type predictors called "experts" and show its consistency under a minimum of conditions. Our approach builds on the methodology developed in recent years for prediction of individual sequences and exploits the quantile structure as a minimizer of the so-called pinball loss function. We perform an in-depth analysis of real-world data sets and show that this nonparametric strategy generally outperforms standard quantile prediction methods"
time-series  prediction  models  statistics  nudge-targets  learning-from-data  machine-learning 
june 2010 by Vaguery
Getting Started Guide - Google Prediction API - Google Code
"The Prediction API allows you to get more from your data and makes its patterns more accessible. Specifically, the Prediction API leverages Google's machine learning infrastructure to give you the tools to better analyze your data and reveal patterns that are often difficult to manually discover. The API also enables you to use those patterns to predict new outcomes, which facilitates the development of all types of software, from textual analysis systems to recommendation systems. Because the Prediction API is a RESTful HTTP service, you can easily access it from Google App Engine, Apps Script, and other Internet-connected desktop applications."
nudge  machine-learning  models  google  prediction  clustering  learning-from-data  AI  API  open-science 
may 2010 by Vaguery
[0906.4779] Minimum Probability Flow Learning
"Learning in probabilistic models is often hampered by the general intractability of the normalization factor and its derivatives. Here we propose a new learning technique that obviates the need to compute an intractable normalization factor or sample from the equilibrium distribution of the model. This is achieved by establishing dynamics that would transform the observed data distribution into the model distribution, and then setting as the objective the minimization of the initial flow of probability away from the data distribution.…"
learning-from-data  statistics  machine-learning  estimation  algorithms  to-understand 
may 2010 by Vaguery
Lee Byron » Else » Stream Graph Paper
"In February 2008, the New York Times published an unusual chart of box office revenues for 7500 movies over 21 years. The chart was based on a similar visualization, developed by the first author, that displayed trends in music listening. This paper describes the design decisions and algorithms behind these graphics, and discusses the reaction on the Web. We suggest that this type of complex layered graph is effective for displaying large data sets to a mass audience. We provide a mathematical analysis of how this layered graph relates to traditional stacked graphs and to techniques such as ThemeRiver, showing how each method is optimizing a different “energy function”. Finally, we discuss techniques for coloring and ordering the layers of such graphs. Throughout the paper, we emphasize the interplay between considerations of aesthetics and legibility."
visualization  dataviz  data-analysis  time-series  learning-from-data  answer-factory 
may 2010 by Vaguery
[0911.2651] Optimal map of the modular structure of complex networks
"…Generally speaking, modules are islands of highly connected nodes separated by a relatively small number of links. Every module can have contributions of links from any node in the network. The challenge is to disentangle these contributions to understand how the modular structure is built. The main problem is that the analysis of a certain partition into modules involves, in principle, as many data as number of modules times number of nodes. To confront this challenge, here we first define the contribution matrix, the mathematical object containing all the information about the partition of interest, and after, we use a Truncated Singular Value Decomposition to extract the best representation of this matrix in a plane. The analysis of this projection allow us to scrutinize the skeleton of the modular structure, revealing the structure of individual modules and their interrelations."
network-thinking  complexology  inference  modeling-is-not-mathematics  learning-from-data 
may 2010 by Vaguery
[1005.0390] Machine Learning for Galaxy Morphology Classification
"In this work, decision tree learning algorithms and fuzzy inferencing systems are applied for galaxy morphology classification. In particular, the CART, the C4.5, the Random Forest and fuzzy logic algorithms are studied and reliable classifiers are developed to distinguish between spiral galaxies, elliptical galaxies or star/unknown galactic objects. Morphology information for the training and testing datasets is obtained from the Galaxy Zoo project while the corresponding photometric and spectra parameters are downloaded from the SDSS DR7 catalogue."
nudge-targets  learning-from-data  machine-learning  crowdsourcing  galaxy-zoo  public-data  datasets 
may 2010 by Vaguery
[1005.0919] Attribute Weighting with Adaptive NBTree for Reducing False Positives in Intrusion Detection
"… Due to the tremendous growth of network-based services, intrusion detection has emerged as an important technique for network security. Recently data mining algorithms are applied on network-based traffic data and host-based program behaviors to detect intrusions or misuse patterns, but there exist some issues in current intrusion detection algorithms such as unbalanced detection rates, large numbers of false positives, and redundant attributes that will lead to the complexity of detection model and degradation of detection accuracy. The purpose of this study is to identify important input attributes for building an intrusion detection system (IDS) that is computationally efficient and effective.…"
nudge-targets  system-administration  security  algorithms  machine-learning  learning-from-data  learning-by-watching  statistics 
may 2010 by Vaguery
[1005.0972] Adaptive Tuning Algorithm for Performance tuning of Database Management System
"Performance tuning of Database Management Systems(DBMS) is both complex and challenging as it involves identifying and altering several key performance tuning parameters. The quality of tuning and the extent of performance enhancement achieved greatly depends on the skill and experience of the Database Administrator (DBA). As neural networks have the ability to adapt to dynamically changing inputs and also their ability to learn makes them ideal candidates for employing them for tuning purpose. In this paper, a novel tuning algorithm based on neural network estimated tuning parameters is presented. The key performance indicators are proactively monitored….The tuner alters these tuning parameters using the estimated values using a rate change computing algorithm. The preliminary results show that the proposed method is effective in improving the query response time for a variety of workload types."
dba  databases  system-administration  database-administration  design-automation  learning-by-doing  learning-from-data  nudge-targets 
may 2010 by Vaguery
[1005.0437] A Unifying View of Multiple Kernel Learning
"Recent research on multiple kernel learning has lead to a number of approaches for combining kernels in regularized risk minimization. The proposed approaches include different formulations of objectives and varying regularization strategies. In this paper we present a unifying general optimization criterion for multiple kernel learning and show how existing formulations are subsumed as special cases. We also derive the criterion's dual representation, which is suitable for general smooth optimization algorithms. Finally, we evaluate multiple kernel learning in this framework analytically using a Rademacher complexity bound on the generalization error and empirically in a set of experiments."
machine-learning  kernel-methods  mathematics  learning-from-data 
may 2010 by Vaguery
[1005.0967] Detecting Security threats in the Router using Computational Intelligence
"…A version of the method independent of the contrast of the image is considered and is found to be useful for finding the most unusual part (and the most similar part) of the image conditioned on given image. The results can be used to scan large image databases, as for example medical databases.…"
nudge-targets  security  system-administration  DDOS  learning-from-data  adaptive-control  intrusion 
may 2010 by Vaguery
[1005.0527] Detecting the Most Unusual Part of Two and Three-dimensional Digital Images
"…A version of the method independent of the contrast of the image is considered and is found to be useful for finding the most unusual part (and the most similar part) of the image conditioned on given image. The results can be used to scan large image databases, as for example medical databases."
nudge-targets  learning-from-data  diagnostics  image-processing  medical-technology  tomography 
may 2010 by Vaguery
[1005.0957] ECG Feature Extraction Techniques - A Survey Approach
"ECG Feature Extraction plays a significant role in diagnosing most of the cardiac diseases. One cardiac cycle in an ECG signal consists of the P-QRS-T waves. This feature extraction scheme determines the amplitudes and intervals in the ECG signal for subsequent analysis. The amplitudes and intervals value of P-QRS-T segment determines the functioning of heart of every human. Recently, numerous research and techniques have been developed for analyzing the ECG signal. The proposed schemes were mostly based on Fuzzy Logic Methods, Artificial Neural Networks (ANN), Genetic Algorithm (GA), Support Vector Machines (SVM), and other Signal Analysis techniques. All these techniques and algorithms have their advantages and limitations.…
nudge-targets  machine-learning  classification  learning-from-data  diagnostics  medicine 
may 2010 by Vaguery
mperham's bayes_motel at master - GitHub
"BayesMotel is a multi-variate Bayesian classification engine. There are two steps to Bayesian classification:

Training You provide a set of variables along with the proper classification for that set.
Runtime You provide a set of variables and ask for the proper classification according to the training in Step 1.
Commonly this is used for spam detection. You will provide a corpus of emails or other data along with a "Spam/NotSpam" classification. The library will determine which variables affect the classification and use that to judge future data."
Ruby  rubygem  Bayesian  classification  statistics  learning-from-data  machine-learning  algorithms 
april 2010 by Vaguery
[1004.3980] Hashing Image Patches for Zooming
"In this paper we present a Bayesian image zooming/super-resolution algorithm based on a patch based representation. We work on a patch based model with overlap and employ a Locally Linear Embedding (LLE) based approach as our data fidelity term in the Bayesian inference. The image prior imposes continuity constraints across the overlapping patches."
image-processing  learning-from-data  machine-learning  statistics 
april 2010 by Vaguery
[1001.5210] Supernova Photometric Classification Challenge
"The goals of this challenge are to (1) learn the relative strengths and weaknesses of the different classification algorithms, (2) use the results to improve classification algorithms, and (3) understand what spectroscopically confirmed sub-sets are needed to properly train these algorithms. The challenge is available at www.hep.anl.gov/SNchallenge, and the due date for classifications is May 1, 2010."
classification  learning-from-data  modeling  challenges  astronomy  statistics  nudge  nudge-targets 
march 2010 by Vaguery
[1003.4002] Spectral Classification; Old and Contemporary
"Beginning with a historical account of the spectral classification, its refinement through additional criteria is presented. The line strengths and ratios used in two dimensional classifications of each spectral class are described. A parallel classification scheme for metal-poor stars and the standards used for classification are presented. The extension of spectral classification beyond M to L and T and spectroscopic classification criteria relevant to these classes are described. Contemporary methods of classifications based upon different automated approaches are introduced."
machine-learning  learning-from-data  science2.0  Nudge  clustering  statistics  astronomy  digitization 
march 2010 by Vaguery
[0908.2033] Galaxy Zoo: Reproducing Galaxy Morphologies Via Machine Learning
"We present morphological classifications obtained using machine learning for objects in SDSS DR6 that have been classified by Galaxy Zoo into three classes, namely early types, spirals and point sources/artifacts. An artificial neural network is trained on a subset of objects classified by the human eye and we test whether the machine learning algorithm can reproduce the human classifications for the rest of the sample. We find that the success of the neural network in matching the human classifications depends crucially on the set of input parameters chosen for the machine-learning algorithm. The colours and parameters associated with profile-fitting are reasonable in separating the objects into three classes. However, these results are considerably improved when adding adaptive shape parameters as well as concentration and texture. …"
learning-from-data  machine-learning  galaxy-zoo  crowdsourcing  crowdsourcing-as-training-data  science2.0  Nudge  variable-selection 
march 2010 by Vaguery
Mind the Gap
"In this Challenge, participants are asked to reconstruct, using any combination of available prior and concurrent information, segments of signals that have been removed from multiparameter recordings of patients in intensive care units (ICUs)."
Nudge  learning-from-data  model-discovery  datasets  data  challenge 
march 2010 by Vaguery
Data Marketplace : Find, buy and sell data online
"Data Marketplace makes it easy for people to find, buy and sell data online.

Most data must be aggregated, cleaned, and analyzed to extract useful information. It doesn't make sense that the same person should do all of these things. Data Marketplace connects people who need data with people who are good at collecting, cleaning, and analyzing it.

People request data that they need. Providers upload data to Data Marketplace, provide descriptive metadata, and set a price. Stored metadata is used to help consumers find relevant data through traditional search engines and when browsing the marketplace."
via:arthegall  data  Y-combinator  startup  marketplace  crowdsourcing  learning-from-data  intellectual-property  question-mark-is-left 
march 2010 by Vaguery
Economist's View: "How Have Quantitative Financial Models Been Used and Misused?"
"There are important uses for financial products, even complicated ones, so I don't want to impugn innovation generally, but I also don't want to adopt the position that it was all useful - it clearly wasn't and stronger regulatory oversight is needed. As for the defense of financial models and innovation described above, the statement that innovation generally is the source of economic growth, therefore financial innovation must also be good, isn't much help. Similarly, if saying "models benefit many fields, such as airline safety, and not only financial markets" is the best defense of risk models available, that's telling."
modeling  management-failure  learning-from-data  learning-by-watching  map-is-not-the-territory  financial-crisis  finger-pointing  agility  inagility 
december 2009 by Vaguery
I Now Have Delisted Stock Data! | System Trading with Woodshedder
"I got my data from Norgate Investor Services, (the same folks that provide my end-of-day feed). They only charge a one-time fee for the delisted data, while some of their competitors charge as much as 3x Norgate’s one time fee with the charge recurring annually!
Since adding the delisted database, I have not noted any great differences in the historical results of the systems I work with. I have stated a few times that it is my belief that short-term systems that hold stocks for a few days to a week are not likely to suffer greatly from survivorship bias. So far, this belief is proving to be true."
data  dataset  stocks  history  data-as-a-service  trading  investing  technical-analysis  learning-from-data 
november 2009 by Vaguery
http://arxiv.org/pdf/cs/0406011v1
"Causal state reconstruction has an important advan- tage over VLMM methods. Each state in a VLMM is represented by a single suffix, and consists of all and only the histories ending in that suffix. For many pro- cesses, the causal states contain multiple suffixes. In these cases, multiple “contexts” are needed to repre- sent a single causal state, so VLMMs are generally more complicated than the HMMs we build. The causal state model is the same as the minimal VLMM if and only if every causal state contains a single suffix. This is the case for the process in Fig. 3, where CSSR and VLMM methods will give the same results."
Cosma-R-Shalizi  learning-from-data  models  model-discovery  statistics  complex-systems  time-series  algorithms  nudge 
november 2009 by Vaguery
Is Your Stock Trading System Sick? Take It to the Doctor. | System Trading with Woodshedder
"What I mean by this is that over enough trades, it should not matter that the historical sequence of trades does not match exactly the real-time sequence. Regardless, it is something to keep in mind when comparing historical backtested data to real-time."
trading  financial-engineering  benchmarking  optimization  models  learning-from-data  objectives 
november 2009 by Vaguery
Data Mining Group - PMML 4.0 - General Structure of a PMML Document
"PMML uses XML to represent mining models. The structure of the models is described by an XML Schema. One or more mining models can be contained in a PMML document. A PMML document is an XML document with a root element of type PMML. The general structure of a PMML document is:..."
data-mining  models  learning-from-data  machine-learning  standards  XML  Nudge 
october 2009 by Vaguery
About Tag: Permissions Worth Getting Excited About
"At the moment, any of us who use web applications tend to spend a lot of time and effort populating application databases to make them useful to us. But when we do so, we tend to lose control of our data. They go into a private database schema, and what access we have to that depends entirely on what the application allows us to do. Sometimes there are reasonable ways to get the data back out (some kind of an XML dump perhaps), sometimes not. But always the application is in control. And linking data across applications is, in general, somewhere between hard and impossible.

FluidDB can change all that by leaving the user in control of his or her data, granting the application only such permissions as necessary or desired, and ensuring that the user retains flexability and control."
FluidDB  Terry-Jones  database  design  software-development  innovation  openness  collaboration  learning-from-data  learning-by-doing 
september 2009 by Vaguery
Rewriting Analyst History
"We document widespread changes to the historical I/B/E/S analyst stock recommendations database. Across seven I/B/E/S downloads, obtained between 2000 and 2007, we find that between 6,580 (1.6%) and 97,582 (21.7%) of matched observations are different from one download to the next. The changes include alterations of recommendations, additions and deletions of records, and removal of analyst names. These changes are nonrandom, clustering by analyst reputation, broker size and status, and recommendation boldness, and affect trading signal classifications and back-tests of three stylized facts: profitability of trading signals, profitability of consensus recommendation changes, and persistence in individual analyst stock-picking ability."
data-access  learning-from-data  analysts  stocks  fudging  what-gets-measured-gets-fudged 
july 2009 by Vaguery
Newton Institute Seminar : Wegman, E, 07/01/2008
"In this presentation, we review some fundamentals of visualization and then proceed to describe methods and combinations of methods useful for visualizing high dimensional data. Some methods include parallel coordinates, smooth interpolations of parallel coordinates, grand tours including wrapping tours, fractal tours, pseudo-grand tours, and pixel tours."
via:cshalizi  visualization  learning-from-data  pattern-discovery  graphics  experimental-design  interactivity 
june 2009 by Vaguery
Katya Vladislavleva - Tilburg University
See in particular Chapter 2, on Data Balancing. This is important stuff for those of us dealing with data-driven models and techniques, especially those not based on analytical closed form first-principles junk.
genetic-programming  modeling  data-analysis  learning-from-data  machine-learning  thesis  techniques  numerical-models 
may 2009 by Vaguery
Linear Classifiers and Loss Functions « Justin Domke’s Weblog
"So, in summary– a drop in classification error on test data from .941 to .078. Thats a 17% drop. (Or a 21% drop, depending upon which rate you use as a base.) This from a method that you can implement in basically zero extra work if you already have a linear classifier. Seems worth a try."
classification  machine-learning  statistics  methodologies  heuristics  learning-from-data 
february 2009 by Vaguery
All we want are the facts, ma'am
In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.
"What are you doing?", asked Minsky.
"I am training a randomly wired neural net to play Tic-Tac-Toe," Sussman replied.
"Why is the net wired randomly?", asked Minsky.
"I do not want it to have any preconceptions of how to play", Sussman said.
Minsky shut his eyes.
"Why do you close your eyes?", Sussman asked his teacher.
"So that the room will be empty."
At that moment, Sussman was enlightened.
via:arthegall  via:cshalizi  science  models  modeling  statistics  learning-from-data  pattern-discovery  hubris  hyperbole  Chris-Anderson  that-Greek-dude-with-the-wings-that-melted 
february 2009 by Vaguery
Pyflix - Trac
"Pyflix is a small package written in Python that provides an easy entry point for getting up and running in the Netflix Prize competition. It combines an efficient storage scheme with an intuitive high-level API that allows contestants to focus on the real problem, the recommendation system algorithm. To get started with Pyflix, keep reading."
via:jhofman  data-mining  prediction  analytics  recommendations  modeling  learning-from-data  competition  programming  library  python  scripting  netflix 
january 2009 by Vaguery

related tags

adaptive-control  agility  AI  algorithms  analysts  analytics  answer-factory  API  archive  art  astronomy  astrophysics  Bayesian  benchmarking  bias  biochemistry  boy-am-I-glad-we-folded-the-startup  challenge  challenges  Chris-Anderson  cladistics  classification  cloud-computing  clustering  collaboration  combinatorics  competition  complex-systems  complexology  computational-science  Cosma-R-Shalizi  crowdsourcing  crowdsourcing-as-training-data  data  data-access  data-analysis  data-as-a-service  data-driven  data-mining  database  database-administration  databases  dataset  datasets  dataviz  dba  DDOS  design  design-automation  design-of-measures  diagnostics  digitization  drinking-from-the-firehose  engineering-design  estimation  evolutionary-algorithms  experimental-design  exploratory-data-analysis  finance  financial-crisis  financial-engineering  finger-pointing  finite-state-machine  FluidDB  fudging  futurism  galaxy-zoo  generative-art  genetic-programming  google  graphics  heuristics  high-frequency-trading  history  hubris  hyperbole  image  image-analysis  image-processing  image-segmentation  inagility  inference  infrastructure  innovation  intellectual-property  interactivity  interestingness  intrusion  inverse-problems  investing  kernel-methods  language  learning-by-doing  learning-by-watching  learning-from-data  libraries  library  linguistics  machine-learning  management-failure  map-is-not-the-territory  marketplace  markov-random-field  Mathematica  mathematics  medical-technology  medicine  methodologies  model-discovery  modeling  modeling-is-not-mathematics  models  n-grams  netflix  network-theory  network-thinking  news  Noam-Chomsky  nudge  nudge-targets  numerical-methods  numerical-models  objectives  open-science  openness  operations-research  optimization  papers  pattern-discovery  prediction  programming  public-data  python  question-mark-is-left  R  RApache  recommendations  regression  Ruby  rubygem  science  science2.0  scripting  security  simulation  software-development  standards  startup  statisticians-don't-do-Pragmatism-well  statistics  stocks  structure  supervised-learning  synthesis  system-administration  technical-analysis  techniques  Terry-Jones  that-Greek-dude-with-the-wings-that-melted  theory-and-practice-sitting-in-a-tree  thesis  time-series  to-understand  tomography  trading  training-set  unsupervised-learning  user-experience  variable-selection  via:arsyed  via:arthegall  via:cshalizi  via:jhofman  via:mark.larios  via:tsuomela  visualization  web-design  what-gets-measured-gets-fudged  when-in-Roma  XML  Y-combinator 

Copy this bookmark:



description:


tags: