Vaguery + data-mining 36
[1201.5568] Dynamic trees for streaming and massive data contexts
january 2012 by Vaguery
"Data collection at a massive scale is becoming ubiquitous in a wide variety of settings, from vast offline databases to streaming real-time information. Learning algorithms deployed in such contexts must rely on single-pass inference, where the data history is never revisited. In streaming contexts, learning must also be temporally adaptive to remain up-to-date against unforeseen changes in the data generating mechanism. Although rapidly growing, the online Bayesian inference literature remains challenged by massive data and transient, evolving data streams. Non-parametric modelling techniques can prove particularly ill-suited, as the complexity of the model is allowed to increase with the sample size. In this work, we take steps to overcome these challenges by porting standard streaming techniques, like data discarding and downweighting, into a fully Bayesian framework via the use of informative priors and active learning heuristics. We showcase our methods by augmenting a modern non-parametric modelling framework, dynamic trees, and illustrate its performance on a number of practical examples. The end product is a powerful streaming regression and classification tool, whose performance compares favourably to the state-of-the-art."
data-analysis
learning-from-data
algorithms
drinking-from-the-firehose
nudge
data-mining
january 2012 by Vaguery
Datameer snags $9.25M more to analyze massive amounts of data | VentureBeat
june 2011 by Vaguery
"Datameer, a company that allows users to analyze massive amounts of data without technical know-how, today announced a second round of funding for $9.25 million. The money will be used to hire additional employees for its engineering, sales, and marketing teams."
data-analysis
data-mining
startups
funding
bubblicious
june 2011 by Vaguery
Growing need for data heads
may 2011 by Vaguery
"I've said it before, but if digging into data is your idea of fun, there's a whole mess of excitement and adventure headed your way. There are lots of opportunities already out there in marketing, journalism, tech, the Web, government, and pretty much everywhere you look. And more importantly, there are lots of opportunities that you can make for yourself. This is a great time for data heads."
data-science
data-mining
statistics
jobs
advice
may 2011 by Vaguery
[1007.5510] An algorithm for the principal component analysis of large data sets
august 2010 by Vaguery
"Recently popularized randomized methods for principal component analysis (PCA) efficiently and reliably produce nearly optimal accuracy - even on parallel processors - unlike the classical (deterministic) alternatives. We adapt one of these randomized methods for use with data sets that are too large to be stored in random-access memory (RAM). (The traditional terminology is that our procedure works efficiently "out-of-core.") We illustrate the performance of the algorithm via several numerical examples. For example, we report on the PCA of a data set stored on disk that is so large that less than a hundredth of it can fit in our computer's RAM."
algorithms
big-data-will-lead-to-big-inference
statistics
data-mining
exploratory-data-analysis
august 2010 by Vaguery
[1006.4968] Validation of credit default probabilities via multiple testing procedures
june 2010 by Vaguery
"We apply multiple testing procedures to the validation of estimated default probabilities in credit rating systems. The goal is to identify rating classes for which the probability of default is estimated inaccurately, while still maintaining a predefined level of committing type I errors as measured by the familywise error rate (FWER) and the false discovery rate (FDR). For FWER, we also consider procedures that take possible discreteness of the data resp. test statistics into account. The performance of these methods is illustrated in a simulation setting and for empirical default data."
finance
prediction
data-mining
models
statistics
machine-learning
nudge-targets
june 2010 by Vaguery
[1006.5273] Linear Detrending Subsequence Matching in Time-Series Databases
june 2010 by Vaguery
"Each time-series has its own linear trend, the directionality of a timeseries, and removing the linear trend is crucial to get the more intuitive matching results. Supporting the linear detrending in subsequence matching is a challenging problem due to a huge number of possible subsequences. In this paper we define this problem the linear detrending subsequence matching and propose its efficient index-based solution. To this end, we first present a notion of LD-windows (LD means linear detrending), which is obtained as follows: we eliminate the linear trend from a subsequence rather than each window itself and obtain LD-windows by dividing the subsequence into windows. Using the LD-windows we then present a lower bounding theorem for the index-based matching solution and formally prove its correctness.…"
time-series
data-mining
data-analysis
prediction
statistics
nudge-targets
june 2010 by Vaguery
CASS
june 2010 by Vaguery
"In the social sciences, it is useful to understand the relative similarities of concepts that are embedded in a particular text (from a particular group or a particular person). For example, in trying to estimate conservative bias in FoxNews, one might estimate its tendency to associate conservative concepts (conservative, republican) and good concepts (good, positive, etc.), compared to conservative and bad concepts. The output would indicate conservative favoritism. This comparison could be further refined by taking into account important "baseline" information about the valences associated with liberal, namely liberal and good in comparison to liberal and bad.…"
text-mining
natural-language-processing
data-mining
machine-learning
Ruby
library
june 2010 by Vaguery
[1006.4929] Detecting epistasis via Markov bases
june 2010 by Vaguery
Specifically: "Genome-wide association study of hair length in dogs"
nudge-targets
epistasis
bioinformatics
genomics
data-mining
firehose-drinking
phenotype-genotype-stuff
june 2010 by Vaguery
A Peek Into the Future: HFT and Financial News -- Seeking Alpha
june 2010 by Vaguery
"A still more realistic and subtle, but much more troublesome scenario: Financial Undetectable Journalistic Engineering (FUJE). Financial news journalists could word the reports differently and send very different signals to the robot army. Here're two actual news headlines re. the May NFP number (incidentally, both are from the same outlet, same day, different reporter -- just a random google search):
US adds 431,000 jobs in May, unemployment down to 9.7 pct
vs.
Despite Adding 431K Jobs, May Non-Farm Payroll Figures Disappoint
The first is factual; the second contains more in-depth analysis. It takes an experienced human to parse and reconcile the two. You can see how robot readers may assign opposite signs to each."
data-mining
high-frequency-trading
trading
news
learning-from-data
boy-am-I-glad-we-folded-the-startup
US adds 431,000 jobs in May, unemployment down to 9.7 pct
vs.
Despite Adding 431K Jobs, May Non-Farm Payroll Figures Disappoint
The first is factual; the second contains more in-depth analysis. It takes an experienced human to parse and reconcile the two. You can see how robot readers may assign opposite signs to each."
june 2010 by Vaguery
What is data science? - O'Reilly Radar
june 2010 by Vaguery
"We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data?
In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets."
data-analysis
data-mining
learning-from-data
statistics
futurism
drinking-from-the-firehose
nudge
via:tsuomela
In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets."
june 2010 by Vaguery
Data Mining Group - PMML 4.0 - General Structure of a PMML Document
october 2009 by Vaguery
"PMML uses XML to represent mining models. The structure of the models is described by an XML Schema. One or more mining models can be contained in a PMML document. A PMML document is an XML document with a root element of type PMML. The general structure of a PMML document is:..."
data-mining
models
learning-from-data
machine-learning
standards
XML
Nudge
october 2009 by Vaguery
Model Selection
september 2009 by Vaguery
"In statistics and machine learning, "model selection" is the problem of picking among different mathematical models which all purport to describe the same data set. This notebook will not (for now) give advice on it; as usual, it's more of a place to organize my thoughts and references..."
Cosma-R-Shalizi
Nudge
reference
statistics
data-mining
theory
september 2009 by Vaguery
"Statistical Theory and Methods for Complex, High-Dimensional Data"
june 2009 by Vaguery
To read in context of current practices of Pareto-GP model discovery: are there any cultural similarities <i>at all</i> between these people and the GP practitioners' approach?
via:cshalizi
data-mining
models
model-discovery
heuristics
statistics
fat-data
june 2009 by Vaguery
About Us | Polymeme
march 2009 by Vaguery
"Polymeme helps you navigate the new networked public sphere and keep your fingers on the intellectual pulse of the blogosphere.
Polymeme helps you discover intelligent content that lies beyond the usual echo chambers of tech news, celebrity gossip or American politics.
Our site uses a unique buzz-tracking approach to identify what's currently hot in 20 areas, ranging from economics to evolution, and present it to the reader along with all sources that are currently talking about it. Thus, you can track how ideas – or memes – propagate through this new emerging networked public sphere. We would consider our mission a success if we expose you to the maximum number of new ideas on every 100 news items you read!"
social-software
social-networks
marketing
madness-of-crowds
blogging
media
data-mining
trends
aggregation
Polymeme helps you discover intelligent content that lies beyond the usual echo chambers of tech news, celebrity gossip or American politics.
Our site uses a unique buzz-tracking approach to identify what's currently hot in 20 areas, ranging from economics to evolution, and present it to the reader along with all sources that are currently talking about it. Thus, you can track how ideas – or memes – propagate through this new emerging networked public sphere. We would consider our mission a success if we expose you to the maximum number of new ideas on every 100 news items you read!"
march 2009 by Vaguery
Pyflix - Trac
january 2009 by Vaguery
"Pyflix is a small package written in Python that provides an easy entry point for getting up and running in the Netflix Prize competition. It combines an efficient storage scheme with an intuitive high-level API that allows contestants to focus on the real problem, the recommendation system algorithm. To get started with Pyflix, keep reading."
via:jhofman
data-mining
prediction
analytics
recommendations
modeling
learning-from-data
competition
programming
library
python
scripting
netflix
january 2009 by Vaguery
Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy
january 2009 by Vaguery
reproduce this using Pareto-GP?
data-mining
prediction
modeling
variable-selection
regression
analytics
Nudge
january 2009 by Vaguery
AltSearchEngines » Blog Archive » How to Search for Influencers with Datanetis
december 2008 by Vaguery
Be braced:
"For someone that has been working building software for the marketing automation industry over 8 years now and is familiar with multiple solutions for finding the right prospect out of many, it was an eye opener. I’m evidencing the progression from mass email campaigns through marketing to target individuals with a matching/relevant offers (data mining, behavioral pattern, collaborate filtering, recommendation engines) to finding customers that can market for you - agents."
social-networks
marketing
influence
advertising
data-mining
networks
search-engines
"For someone that has been working building software for the marketing automation industry over 8 years now and is familiar with multiple solutions for finding the right prospect out of many, it was an eye opener. I’m evidencing the progression from mass email campaigns through marketing to target individuals with a matching/relevant offers (data mining, behavioral pattern, collaborate filtering, recommendation engines) to finding customers that can market for you - agents."
december 2008 by Vaguery
Where to Look for Ideas in This Market - Seeking Alpha
october 2008 by Vaguery
"Last year, more than 16,000 companies filed 10-Ks or 10KSBs with the SEC. Assuming they come in evenly (they don't, but we'll say this for simplicity's sake), that would be more than 4,000 annual reports a quarter, or roughly 44 a day, each and every day of the year. Forget holidays, vacations, or your kids' birthdays — you've got annual reports to read!"
Nudge
sentiment
prediction
mining
data-mining
datasets
genetic-programming
training
validation
october 2008 by Vaguery
Texture Synthesis Links
march 2008 by Vaguery
Various potentially useful resources for texture synthesis and image analysis applications of genetic programming.
resources
library
machine-learning
datasets
data-analysis
data-mining
test-cases
march 2008 by Vaguery
(theinfo)
january 2008 by Vaguery
"This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange tips and tricks, dev
via:arthegall
algorithms
analytics
collaboration
collection
data
data-analysis
data-mining
hacking
open
research
tools
january 2008 by Vaguery
FOODPAIRING
december 2007 by Vaguery
Some kind of network of interchangeable and complementary food ingredients. Somewhat questionably vague.
food
flavor
cookery
networks
data-mining
visualization
recommendations
recipes
december 2007 by Vaguery
PostModel
february 2007 by Vaguery
I wonder whether it will also interface easily with Mathematica, Matlab, and other packages? Seems like....
PostGre
database
software
development
data-mining
models
extension
february 2007 by Vaguery
related tags
3d ⊕ advertising ⊕ advice ⊕ aggregation ⊕ algorithms ⊕ analysis ⊕ analytics ⊕ applications ⊕ architecture ⊕ archive ⊕ automation ⊕ big-data-will-lead-to-big-inference ⊕ bioinformatics ⊕ blogging ⊕ boy-am-I-glad-we-folded-the-startup ⊕ bubblicious ⊕ challenge ⊕ classification ⊕ collaboration ⊕ collection ⊕ competition ⊕ computing ⊕ conferences ⊕ contest ⊕ cookery ⊕ Cosma-R-Shalizi ⊕ crystallography ⊕ data ⊕ data-analysis ⊕ data-cleaning ⊕ data-mining ⊖ data-science ⊕ database ⊕ databases ⊕ dataset ⊕ datasets ⊕ decision-support ⊕ development ⊕ distributed-processing ⊕ drinking-from-the-firehose ⊕ dynamics ⊕ engineering ⊕ epistasis ⊕ equities ⊕ exploratory-data-analysis ⊕ extension ⊕ fat-data ⊕ feature-detection ⊕ finance ⊕ firehose-drinking ⊕ flavor ⊕ food ⊕ funding ⊕ future ⊕ futurism ⊕ genetic-programming ⊕ genomics ⊕ Google ⊕ government ⊕ GPL ⊕ hacking ⊕ heuristics ⊕ high-frequency-trading ⊕ influence ⊕ information-overload ⊕ jobs ⊕ KDD ⊕ language ⊕ learning ⊕ learning-from-data ⊕ library ⊕ linguistics ⊕ machine-learning ⊕ MacOS ⊕ madness-of-crowds ⊕ MapReduce ⊕ marketing ⊕ markets ⊕ media ⊕ mining ⊕ model-discovery ⊕ modeling ⊕ models ⊕ n-grams ⊕ natural-language-processing ⊕ netflix ⊕ networks ⊕ news ⊕ NLP ⊕ nudge ⊕ nudge-targets ⊕ open ⊕ open-source ⊕ openness ⊕ papers ⊕ pattern-discovery ⊕ phenotype-genotype-stuff ⊕ policy ⊕ PostGre ⊕ prediction ⊕ Privacy ⊕ programming ⊕ protein-folding ⊕ python ⊕ recipes ⊕ recommendations ⊕ reference ⊕ regression ⊕ research ⊕ resources ⊕ Ruby ⊕ science ⊕ scientific-computing ⊕ scripting ⊕ search-engines ⊕ self-definition ⊕ semantic ⊕ sentiment ⊕ sharing ⊕ social-networks ⊕ social-software ⊕ society ⊕ software ⊕ standards ⊕ startups ⊕ statistics ⊕ stocks ⊕ structural-biology ⊕ summary ⊕ test-cases ⊕ text ⊕ text-mining ⊕ theory ⊕ thesis ⊕ time-series ⊕ toolkit ⊕ tools ⊕ trading ⊕ training ⊕ transparency ⊕ trends ⊕ validation ⊕ variable-selection ⊕ via:arsyed ⊕ via:arthegall ⊕ via:cshalizi ⊕ via:jhofman ⊕ via:logista ⊕ via:tsuomela ⊕ visual-programming ⊕ visualization ⊕ web2.0 ⊕ XML ⊕Copy this bookmark: