Vaguery + data-mining   36

[1201.5568] Dynamic trees for streaming and massive data contexts
"Data collection at a massive scale is becoming ubiquitous in a wide variety of settings, from vast offline databases to streaming real-time information. Learning algorithms deployed in such contexts must rely on single-pass inference, where the data history is never revisited. In streaming contexts, learning must also be temporally adaptive to remain up-to-date against unforeseen changes in the data generating mechanism. Although rapidly growing, the online Bayesian inference literature remains challenged by massive data and transient, evolving data streams. Non-parametric modelling techniques can prove particularly ill-suited, as the complexity of the model is allowed to increase with the sample size. In this work, we take steps to overcome these challenges by porting standard streaming techniques, like data discarding and downweighting, into a fully Bayesian framework via the use of informative priors and active learning heuristics. We showcase our methods by augmenting a modern non-parametric modelling framework, dynamic trees, and illustrate its performance on a number of practical examples. The end product is a powerful streaming regression and classification tool, whose performance compares favourably to the state-of-the-art."
data-analysis  learning-from-data  algorithms  drinking-from-the-firehose  nudge  data-mining 
january 2012 by Vaguery
Datameer snags $9.25M more to analyze massive amounts of data | VentureBeat
"Datameer, a company that allows users to analyze massive amounts of data without technical know-how, today announced a second round of funding for $9.25 million. The money will be used to hire additional employees for its engineering, sales, and marketing teams."
data-analysis  data-mining  startups  funding  bubblicious 
june 2011 by Vaguery
Growing need for data heads
"I've said it before, but if digging into data is your idea of fun, there's a whole mess of excitement and adventure headed your way. There are lots of opportunities already out there in marketing, journalism, tech, the Web, government, and pretty much everywhere you look. And more importantly, there are lots of opportunities that you can make for yourself. This is a great time for data heads."
data-science  data-mining  statistics  jobs  advice 
may 2011 by Vaguery
[1007.5510] An algorithm for the principal component analysis of large data sets
"Recently popularized randomized methods for principal component analysis (PCA) efficiently and reliably produce nearly optimal accuracy - even on parallel processors - unlike the classical (deterministic) alternatives. We adapt one of these randomized methods for use with data sets that are too large to be stored in random-access memory (RAM). (The traditional terminology is that our procedure works efficiently "out-of-core.") We illustrate the performance of the algorithm via several numerical examples. For example, we report on the PCA of a data set stored on disk that is so large that less than a hundredth of it can fit in our computer's RAM."
algorithms  big-data-will-lead-to-big-inference  statistics  data-mining  exploratory-data-analysis 
august 2010 by Vaguery
[1006.4968] Validation of credit default probabilities via multiple testing procedures
"We apply multiple testing procedures to the validation of estimated default probabilities in credit rating systems. The goal is to identify rating classes for which the probability of default is estimated inaccurately, while still maintaining a predefined level of committing type I errors as measured by the familywise error rate (FWER) and the false discovery rate (FDR). For FWER, we also consider procedures that take possible discreteness of the data resp. test statistics into account. The performance of these methods is illustrated in a simulation setting and for empirical default data."
finance  prediction  data-mining  models  statistics  machine-learning  nudge-targets 
june 2010 by Vaguery
[1006.5273] Linear Detrending Subsequence Matching in Time-Series Databases
"Each time-series has its own linear trend, the directionality of a timeseries, and removing the linear trend is crucial to get the more intuitive matching results. Supporting the linear detrending in subsequence matching is a challenging problem due to a huge number of possible subsequences. In this paper we define this problem the linear detrending subsequence matching and propose its efficient index-based solution. To this end, we first present a notion of LD-windows (LD means linear detrending), which is obtained as follows: we eliminate the linear trend from a subsequence rather than each window itself and obtain LD-windows by dividing the subsequence into windows. Using the LD-windows we then present a lower bounding theorem for the index-based matching solution and formally prove its correctness.…"
time-series  data-mining  data-analysis  prediction  statistics  nudge-targets 
june 2010 by Vaguery
CASS
"In the social sciences, it is useful to understand the relative similarities of concepts that are embedded in a particular text (from a particular group or a particular person). For example, in trying to estimate conservative bias in FoxNews, one might estimate its tendency to associate conservative concepts (conservative, republican) and good concepts (good, positive, etc.), compared to conservative and bad concepts. The output would indicate conservative favoritism. This comparison could be further refined by taking into account important "baseline" information about the valences associated with liberal, namely liberal and good in comparison to liberal and bad.…"
text-mining  natural-language-processing  data-mining  machine-learning  Ruby  library 
june 2010 by Vaguery
A Peek Into the Future: HFT and Financial News -- Seeking Alpha
"A still more realistic and subtle, but much more troublesome scenario: Financial Undetectable Journalistic Engineering (FUJE). Financial news journalists could word the reports differently and send very different signals to the robot army. Here're two actual news headlines re. the May NFP number (incidentally, both are from the same outlet, same day, different reporter -- just a random google search):

US adds 431,000 jobs in May, unemployment down to 9.7 pct
vs.

Despite Adding 431K Jobs, May Non-Farm Payroll Figures Disappoint
The first is factual; the second contains more in-depth analysis. It takes an experienced human to parse and reconcile the two. You can see how robot readers may assign opposite signs to each."
data-mining  high-frequency-trading  trading  news  learning-from-data  boy-am-I-glad-we-folded-the-startup 
june 2010 by Vaguery
What is data science? - O'Reilly Radar
"We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data?

In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets."
data-analysis  data-mining  learning-from-data  statistics  futurism  drinking-from-the-firehose  nudge  via:tsuomela 
june 2010 by Vaguery
Data Mining Group - PMML 4.0 - General Structure of a PMML Document
"PMML uses XML to represent mining models. The structure of the models is described by an XML Schema. One or more mining models can be contained in a PMML document. A PMML document is an XML document with a root element of type PMML. The general structure of a PMML document is:..."
data-mining  models  learning-from-data  machine-learning  standards  XML  Nudge 
october 2009 by Vaguery
Model Selection
"In statistics and machine learning, "model selection" is the problem of picking among different mathematical models which all purport to describe the same data set. This notebook will not (for now) give advice on it; as usual, it's more of a place to organize my thoughts and references..."
Cosma-R-Shalizi  Nudge  reference  statistics  data-mining  theory 
september 2009 by Vaguery
"Statistical Theory and Methods for Complex, High-Dimensional Data"
To read in context of current practices of Pareto-GP model discovery: are there any cultural similarities <i>at all</i> between these people and the GP practitioners' approach?
via:cshalizi  data-mining  models  model-discovery  heuristics  statistics  fat-data 
june 2009 by Vaguery
About Us | Polymeme
"Polymeme helps you navigate the new networked public sphere and keep your fingers on the intellectual pulse of the blogosphere.

Polymeme helps you discover intelligent content that lies beyond the usual echo chambers of tech news, celebrity gossip or American politics.

Our site uses a unique buzz-tracking approach to identify what's currently hot in 20 areas, ranging from economics to evolution, and present it to the reader along with all sources that are currently talking about it. Thus, you can track how ideas – or memes – propagate through this new emerging networked public sphere. We would consider our mission a success if we expose you to the maximum number of new ideas on every 100 news items you read!"
social-software  social-networks  marketing  madness-of-crowds  blogging  media  data-mining  trends  aggregation 
march 2009 by Vaguery
Pyflix - Trac
"Pyflix is a small package written in Python that provides an easy entry point for getting up and running in the Netflix Prize competition. It combines an efficient storage scheme with an intuitive high-level API that allows contestants to focus on the real problem, the recommendation system algorithm. To get started with Pyflix, keep reading."
via:jhofman  data-mining  prediction  analytics  recommendations  modeling  learning-from-data  competition  programming  library  python  scripting  netflix 
january 2009 by Vaguery
AltSearchEngines » Blog Archive » How to Search for Influencers with Datanetis
Be braced:

"For someone that has been working building software for the marketing automation industry over 8 years now and is familiar with multiple solutions for finding the right prospect out of many, it was an eye opener. I’m evidencing the progression from mass email campaigns through marketing to target individuals with a matching/relevant offers (data mining, behavioral pattern, collaborate filtering, recommendation engines) to finding customers that can market for you - agents."
social-networks  marketing  influence  advertising  data-mining  networks  search-engines 
december 2008 by Vaguery
Where to Look for Ideas in This Market - Seeking Alpha
"Last year, more than 16,000 companies filed 10-Ks or 10KSBs with the SEC. Assuming they come in evenly (they don't, but we'll say this for simplicity's sake), that would be more than 4,000 annual reports a quarter, or roughly 44 a day, each and every day of the year. Forget holidays, vacations, or your kids' birthdays — you've got annual reports to read!"
Nudge  sentiment  prediction  mining  data-mining  datasets  genetic-programming  training  validation 
october 2008 by Vaguery
Texture Synthesis Links
Various potentially useful resources for texture synthesis and image analysis applications of genetic programming.
resources  library  machine-learning  datasets  data-analysis  data-mining  test-cases 
march 2008 by Vaguery
(theinfo)
"This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange tips and tricks, dev
via:arthegall  algorithms  analytics  collaboration  collection  data  data-analysis  data-mining  hacking  open  research  tools 
january 2008 by Vaguery
FOODPAIRING
Some kind of network of interchangeable and complementary food ingredients. Somewhat questionably vague.
food  flavor  cookery  networks  data-mining  visualization  recommendations  recipes 
december 2007 by Vaguery
PostModel
I wonder whether it will also interface easily with Mathematica, Matlab, and other packages? Seems like....
PostGre  database  software  development  data-mining  models  extension 
february 2007 by Vaguery

related tags

3d  advertising  advice  aggregation  algorithms  analysis  analytics  applications  architecture  archive  automation  big-data-will-lead-to-big-inference  bioinformatics  blogging  boy-am-I-glad-we-folded-the-startup  bubblicious  challenge  classification  collaboration  collection  competition  computing  conferences  contest  cookery  Cosma-R-Shalizi  crystallography  data  data-analysis  data-cleaning  data-mining  data-science  database  databases  dataset  datasets  decision-support  development  distributed-processing  drinking-from-the-firehose  dynamics  engineering  epistasis  equities  exploratory-data-analysis  extension  fat-data  feature-detection  finance  firehose-drinking  flavor  food  funding  future  futurism  genetic-programming  genomics  Google  government  GPL  hacking  heuristics  high-frequency-trading  influence  information-overload  jobs  KDD  language  learning  learning-from-data  library  linguistics  machine-learning  MacOS  madness-of-crowds  MapReduce  marketing  markets  media  mining  model-discovery  modeling  models  n-grams  natural-language-processing  netflix  networks  news  NLP  nudge  nudge-targets  open  open-source  openness  papers  pattern-discovery  phenotype-genotype-stuff  policy  PostGre  prediction  Privacy  programming  protein-folding  python  recipes  recommendations  reference  regression  research  resources  Ruby  science  scientific-computing  scripting  search-engines  self-definition  semantic  sentiment  sharing  social-networks  social-software  society  software  standards  startups  statistics  stocks  structural-biology  summary  test-cases  text  text-mining  theory  thesis  time-series  toolkit  tools  trading  training  transparency  trends  validation  variable-selection  via:arsyed  via:arthegall  via:cshalizi  via:jhofman  via:logista  via:tsuomela  visual-programming  visualization  web2.0  XML 

Copy this bookmark:



description:


tags: