Vaguery + data-analysis 68
Visualization series: Insight from Cleveland and Tufte on plotting numeric data by groups | Solomon Messing
11 weeks ago by Vaguery
"A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below). Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points. Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive."
visualization
data-analysis
communication
graphic-design
argumentation
statistics
ggplot2
11 weeks ago by Vaguery
[1202.0077] An Interacting Particle Model for Clustering Euclidean Datasets
february 2012 by Vaguery
"In this paper we propose a method based on interacting particle physics, devised for clustering Euclidean datasets without initial constraints or conditions. We model any dataset as an interacting particle system, whose elements correspond to particles that interact through a simplified version of Lennard-Jones potentials. In so doing, mutual attractive interactions allow to identify groups of proximal particles. The main outcome of this modeling task is an adjacency matrix, taken as input by a community detection algorithm aimed to identify different partitions. The underlying conjecture is that, using a multiresolution analysis, the adopted model allows to find the right number of clusters for any given dataset. Experimental results, performed in comparison with a classical clustering algorithm, confirm this assumption."
clustering
data-analysis
algorithms
nudge-targets
distributed-processing
february 2012 by Vaguery
[1201.5568] Dynamic trees for streaming and massive data contexts
january 2012 by Vaguery
"Data collection at a massive scale is becoming ubiquitous in a wide variety of settings, from vast offline databases to streaming real-time information. Learning algorithms deployed in such contexts must rely on single-pass inference, where the data history is never revisited. In streaming contexts, learning must also be temporally adaptive to remain up-to-date against unforeseen changes in the data generating mechanism. Although rapidly growing, the online Bayesian inference literature remains challenged by massive data and transient, evolving data streams. Non-parametric modelling techniques can prove particularly ill-suited, as the complexity of the model is allowed to increase with the sample size. In this work, we take steps to overcome these challenges by porting standard streaming techniques, like data discarding and downweighting, into a fully Bayesian framework via the use of informative priors and active learning heuristics. We showcase our methods by augmenting a modern non-parametric modelling framework, dynamic trees, and illustrate its performance on a number of practical examples. The end product is a powerful streaming regression and classification tool, whose performance compares favourably to the state-of-the-art."
data-analysis
learning-from-data
algorithms
drinking-from-the-firehose
nudge
data-mining
january 2012 by Vaguery
[1112.2316] Complexity-entropy causality plane: a useful approach for distinguishing songs
january 2012 by Vaguery
Nowadays we are often faced with huge databases resulting from the rapid growth of data storage technologies. This is particularly true when dealing with music databases. In this context, it is essential to have techniques and tools able to discriminate properties from these massive sets. In this work, we report on a statistical analysis of more than ten thousand songs aiming to obtain a complexity hierarchy. Our approach is based on the estimation of the permutation entropy combined with an intensive complexity measure, building up the complexity-entropy causality plane. The results obtained indicate that this representation space is very promising to discriminate songs as well as to allow a relative quantitative comparison among songs. Additionally, we believe that the here-reported method may be applied in practical situations since it is simple, robust and has a fast numerical implementation.
signal-processing
classification
data-analysis
clustering
representation
music
nudge-targets
january 2012 by Vaguery
Classifying Heart Sounds Challenge
november 2011 by Vaguery
"According to the World Health Organisation, cardiovascular diseases (CVDs) are the number one cause of death globally: more people die annually from CVDs than from any other cause. An estimated 17.1 million people died from CVDs in 2004, representing 29% of all global deaths. Of these deaths, an estimated 7.2 million were due to coronary heart disease. Any method which can help to detect signs of heart disease could therefore have a significant impact on world health. This challenge is to produce methods to do exactly that. Specifically, we are interested in creating the first level of screening of cardiac pathologies both in a Hospital environment by a doctor (using a digital stethoscope) and at home by the patient (using a mobile device).
The problem is of particular interest to machine learning researchers as it involves classification of audio sample data, where distinguishing between classes of interest is non-trivial. Data is gathered in real-world situations and frequently contains background noise of every conceivable type. The differences between heart sounds corresponding to different heart symptoms can also be extremely subtle and challenging to separate. Success in classifying this form of data requires extremely robust classifiers. Despite its medical significance, to date this is a relatively unexplored application for machine learning."
machine-learning
competition
nudge-targets
classification
segmentation
data-analysis
supervised-learning
The problem is of particular interest to machine learning researchers as it involves classification of audio sample data, where distinguishing between classes of interest is non-trivial. Data is gathered in real-world situations and frequently contains background noise of every conceivable type. The differences between heart sounds corresponding to different heart symptoms can also be extremely subtle and challenging to separate. Success in classifying this form of data requires extremely robust classifiers. Despite its medical significance, to date this is a relatively unexplored application for machine learning."
november 2011 by Vaguery
[1105.4953] A fast nearest neighbor search algorithm based on vector quantization
october 2011 by Vaguery
"In this article, we propose a new fast nearest neighbor search algorithm, based on vector quantization. Like many other branch and bound search algorithms [1,10], a preprocessing recursively partitions the data set into disjointed subsets until the number of points in each part is small enough. In doing so, a search-tree data structure is built. This preliminary recursive data-set partition is based on the vector quantization of the empirical distribution of the initial data-set. Unlike previously cited methods, this kind of partitions does not a priori allow to eliminate several brother nodes in the search tree with a single test. To overcome this difficulty, we propose an algorithm to reduce the number of tested brother nodes to a minimal list that we call "friend Voronoi cells". The complete description of the method requires a deeper insight into the properties of Delaunay triangulations and Voronoi diagrams"
algorithms
search-algorithms
data-analysis
nudge-targets
october 2011 by Vaguery
Datameer snags $9.25M more to analyze massive amounts of data | VentureBeat
june 2011 by Vaguery
"Datameer, a company that allows users to analyze massive amounts of data without technical know-how, today announced a second round of funding for $9.25 million. The money will be used to hire additional employees for its engineering, sales, and marketing teams."
data-analysis
data-mining
startups
funding
bubblicious
june 2011 by Vaguery
"Big Memory" Company Terracotta Snapped Up by Europe's Fourth Largest Software Company
may 2011 by Vaguery
"In-memory is a hot topic right now, thanks in part to SAP pushing its in-memory analytics platform HANA at Sapphire last week. HANA, however, is not a direct competitor to BigMemory. According to RedMonk co-founder James Governor, competitors include Oracle Coherence, IBM eXtreme Scale, Hazelcast and Gigaspaces.
"Indeed distributed cache is well known enough to be seen as a 'competitor' to NoSQL approaches," Governor wrote. "Both take load off the database - less database work generally means greater scalability""
software-architecture
distributed-processing
data-analysis
database
open-source
"Indeed distributed cache is well known enough to be seen as a 'competitor' to NoSQL approaches," Governor wrote. "Both take load off the database - less database work generally means greater scalability""
may 2011 by Vaguery
[0807.1271] Semiparametric curve alignment and shift density estimation for biological data
august 2010 by Vaguery
"Assume that we observe a large number of curves, all of them with identical, although unknown, shape, but with a different random shift. The objective is to estimate the individual time shifts and their distribution. Such an objective appears in several biological applications like neuroscience or ECG signal processing, in which the estimation of the distribution of the elapsed time between repetitive pulses with a possibly low signal-noise ratio, and without a knowledge of the pulse shape is of interest. We suggest an M-estimator leading to a three-stage algorithm: we split our data set in blocks, on which the estimation of the shifts is done by minimizing a cost criterion based on a functional of the periodogram; the estimated shifts are then plugged into a standard density estimator. We show that under mild regularity assumptions the density estimate converges weakly to the true shift distribution. The theory is applied both to simulations and to alignment of real ECG signals.…"
data-analysis
statistics
algorithms
heuristics
exploratory-data-analysis
nudge
optimization
classification
time-series
august 2010 by Vaguery
How did Weather Data Get Opened? - A Healthy Information Diet - InfoVegan.com
august 2010 by Vaguery
"Weather data didn’t come to be because of an Open Government Directive. It wasn’t created because of a White House mandate. Government did not release the data and then enterprising people built companies on top of it. It’s more accurate to make the argument that we have a national weather service because of one man’s deep desire to keep his job and to get promoted to colonel in the Army. It could be a vast network of lobbyists to help that man get promoted, or the vast network of lobbyists from shipping companies trying to get access to data already being created. Or it could be that it was just pretty obvious that access to weather data would save lives."
weather
open-access
data-analysis
big-data-will-lead-to-big-inference
public-policy
marketing
august 2010 by Vaguery
[1008.1758] Stochastic Data Clustering
august 2010 by Vaguery
"In 1961 Herbert Simon and Albert Ando published the theory behind the long-term behavior of a dynamical system that can be described by a nearly completely decomposable matrix. Over the past fifty years this theory has been used in a variety of contexts, including queueing theory, computer performance, and ecology. In all these applications, the structure of the system is known and the point of interest is the various states the system passes through on its way to some long-term equilibrium. This paper looks at this problem from the other direction. That is, we develop a technique for using the evolution of the system to tell us about its initial structure, and we use this technique to develop a new algorithm for data clustering."
clustering
data-analysis
exploratory-data-analysis
statistics
algorithms
august 2010 by Vaguery
Nanex - Market Crop Circle Of The Day
august 2010 by Vaguery
"As we continue to monitor the markets for evidence of Quote Stuffing and Strange Sequences (Crop Circles), we find that there are dozens if not hundreds of examples to choose from on any given day. As such, this page will be updated often with charts demonstrating this activity.
The common theme with the charts shown on this page is they are obviously all generated in code and are algorithmic. Some demonstrate bizarre price or size cycling, some demonstrate large burst of quotes in extremely short time frames and some will demonstrate both. In most cases these sequences are from a single exchange with no other exchange quoting in the same time frame."
machine-learning
trading
financial-engineering
skynet
data-analysis
emergent-design
technical-analysis
behavioral-finance
The common theme with the charts shown on this page is they are obviously all generated in code and are algorithmic. Some demonstrate bizarre price or size cycling, some demonstrate large burst of quotes in extremely short time frames and some will demonstrate both. In most cases these sequences are from a single exchange with no other exchange quoting in the same time frame."
august 2010 by Vaguery
Flash Crash Analysis - May 6'th 2010 - Part 4 - Nanex
august 2010 by Vaguery
"While analyzing HFT (High Frequency Trading) quote counts, we were shocked to find cases where one exchange was sending an extremely high number of quotes for one stock in a single second: as high as 5,000 quotes in 1 second! During May 6, there were hundreds of times that a single stock had over 1,000 quotes from one exchange in a single second. Even more disturbing, there doesn't seem to be any economic justification for this. In many of the cases, the bid/offer is well outside the National Best Bid/Offer (NBBO). We decided to analyze a handful of these cases in detail and graphed the sequential bid/offers to better understand them. What we discovered was a manipulative device with destabilizing effect."
trading
financial-systems
design-automation
complex-systems
emergent-design
engineering
data-analysis
skynet
august 2010 by Vaguery
[1006.4531] Generalised network clustering and its dynamical implications
august 2010 by Vaguery
"A parameterisation of generalised network clustering, in the form of four-motif prevalences, is presented. This involves three real parameters that are conditional on one- two- and three-motif prevalences. Interpretations of these real parameters are presented that motivate a set of rewiring schemes to create appropriately clustered networks. Finally, the dynamical implications of higher order structure, as parameterised, for a contact process are considered."
clustering
network-theory
complexology
nudge-targets
algorithms
data-analysis
comparison
august 2010 by Vaguery
[1005.5141] Constructing Positive Definite Elastic Kernels with Application to Time Series Classification
august 2010 by Vaguery
"This paper proposes some extensions to the work on kernels dedicated to string alignment (biological sequence alignment) based on the summing up of scores obtained by local alignments with gaps. The extensions we propose allow to construct, from classical time-warp distances, what we called summative time-warp kernels that are positive definite if some simple sufficient conditions are satisfied. Furthermore, from the same formalism, we derive a time-warp inner product that extends the usual euclidean inner product, providing the capability to handle discrete sequences or time series of variable lengths in an Hilbert space. The classification experiment we conducted, using either first near neighbor classifier or Support Vector Machine classifier leads to conclude that the positive definite elastic kernels we propose outperform the distance substituting kernels for the classical elastic distances we tested.…"
time-series
data-analysis
nudge-targets
classification
machine-learning
algorithms
august 2010 by Vaguery
[1007.1075] Clustering Stability: An Overview
august 2010 by Vaguery
"A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications."
statistics
data-analysis
clustering
nonparametric-statistics
exploratory-data-analysis
heuristics
august 2010 by Vaguery
[0903.5066] Modified-CS: Modifying Compressive Sensing for Problems with Partially Known Support
july 2010 by Vaguery
"We study the problem of reconstructing a sparse signal from a limited number of its linear projections when a part of its support is known, although the known part may contain some errors. The ``known" part of the support, denoted T, may be available from prior knowledge. Alternatively, in a problem of recursively reconstructing time sequences of sparse spatial signals, one may use the support estimate from the previous time instant as the ``known" part. The idea of our proposed solution (modified-CS) is to solve a convex relaxation of the following problem: find the signal that satisfies the data constraint and is sparsest outside of T.…"
compressed-sensing
algorithms
machine-learning
statistics
signal-processing
nudge-targets
data-analysis
july 2010 by Vaguery
[1007.4748] Detecting influenza outbreaks by analyzing Twitter messages
july 2010 by Vaguery
"We analyze over 500 million Twitter messages from an eight month period and find that tracking a small number of flu-related keywords allows us to forecast future influenza rates with high accuracy, obtaining a 95% correlation with national health statistics. We then analyze the robustness of this approach to spurious keyword matches, and we propose a document classification component to filter these misleading messages. We find that this document classifier can reduce error rates by over half in simulated false alarm experiments, though more research is needed to develop methods that are robust in cases of extremely high noise."
epidemiology
twitter
social-media
data-analysis
public-health
big-data-will-lead-to-big-inference
july 2010 by Vaguery
[1007.3799] Adapting to the Shifting Intent of Search Queries
july 2010 by Vaguery
"Search engines today present results that are often oblivious to abrupt shifts in intent. For example, the query `independence day' usually refers to a US holiday, but the intent of this query abruptly changed during the release of a major film by that name. … This paper shows that the signals a search engine receives can be used to both determine that a shift in intent has happened, as well as find a result that is now more relevant. We present a meta-algorithm that marries a classifier with a bandit algorithm to achieve regret that depends logarithmically on the number of query impressions, under certain assumptions. We provide strong evidence that this regret is close to the best achievable. Finally, via a series of experiments, we demonstrate that our algorithm outperforms prior approaches, particularly as the amount of intent-shifting traffic increases."
search-engines
search-algorithms
machine-learning
social-dynamics
algorithms
nudge-targets
intelligence-gathering
data-analysis
july 2010 by Vaguery
[1007.4191] Fast Moment Estimation in Data Streams in Optimal Space
july 2010 by Vaguery
"We give a space-optimal algorithm with update time O(log^2(1/eps)loglog(1/eps)) for (1+eps)-approximating the pth frequency moment, 0 < p < 2, of a length-n vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous space-optimal algorithm of [Kane-Nelson-Woodruff, SODA 2010], which had update time Omega(1/eps^2)."
nudge-targets
algorithms
data-analysis
online-learning
machine-learning
computational-complexity
statistics
july 2010 by Vaguery
Towards better analytical software | (Articles about R)
july 2010 by Vaguery
"Here are some thoughts on using existing statistical software for better analytics and/or business intelligence (reporting)…"
user-experience
software-development
business-opportunity
business-model
analytics
data-analysis
july 2010 by Vaguery
Environment for DeveLoping KDD-Applications Supported by Index-Structures - Wikipedia, the free encyclopedia
july 2010 by Vaguery
"Environment for DeveLoping KDD-Applications Supported by Index-Structures (ELKI) is a Knowledge Discovery in Databases (KDD, "data mining") software framework developed for use in research and teaching by the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures."
clustering
algorithms
libraries
data-analysis
exploratory-data-analysis
statistics
nudge
july 2010 by Vaguery
[1006.5273] Linear Detrending Subsequence Matching in Time-Series Databases
june 2010 by Vaguery
"Each time-series has its own linear trend, the directionality of a timeseries, and removing the linear trend is crucial to get the more intuitive matching results. Supporting the linear detrending in subsequence matching is a challenging problem due to a huge number of possible subsequences. In this paper we define this problem the linear detrending subsequence matching and propose its efficient index-based solution. To this end, we first present a notion of LD-windows (LD means linear detrending), which is obtained as follows: we eliminate the linear trend from a subsequence rather than each window itself and obtain LD-windows by dividing the subsequence into windows. Using the LD-windows we then present a lower bounding theorem for the index-based matching solution and formally prove its correctness.…"
time-series
data-mining
data-analysis
prediction
statistics
nudge-targets
june 2010 by Vaguery
[1006.4330] Large gaps imputation in remote sensed imagery of the environment
june 2010 by Vaguery
"Imputation of missing data in large regions of satellite imagery is necessary when the acquired image has been damaged by shadows due to clouds, or information gaps produced by sensor failure.
The general approach for imputation of missing data, that could not be considered missed at random, suggests the use of other available data. Previous work, like local linear histogram matching, take advantage of a co-registered older image obtained by the same sensor, yielding good results in filling homogeneous regions, but poor results if the scenes being combined have radical differences in target radiance due, for example, to the presence of sun glint or snow.…"
nudge-targets
definitely-nudge-targets
imputation
statistics
machine-learning
data-analysis
The general approach for imputation of missing data, that could not be considered missed at random, suggests the use of other available data. Previous work, like local linear histogram matching, take advantage of a co-registered older image obtained by the same sensor, yielding good results in filling homogeneous regions, but poor results if the scenes being combined have radical differences in target radiance due, for example, to the presence of sun glint or snow.…"
june 2010 by Vaguery
Protovis 3.2 released – more examples and layouts
june 2010 by Vaguery
"The most recent version of Protovis, the open-source visualization library that uses JavaScript and SVG, was just released not too long ago - this time with more layout and examples. This is especially helpful since Protovis was "designed to be learned by example." Among the new stuff is the ever popular streamgraphs, along with the force-directed layout. With only 10 to 20 lines of code, you'll have your viz, so lots of bang for the buck."
graphs
visualization
data-analysis
javascript
library
protovis
nudge
june 2010 by Vaguery
What is data science? - O'Reilly Radar
june 2010 by Vaguery
"We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data?
In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets."
data-analysis
data-mining
learning-from-data
statistics
futurism
drinking-from-the-firehose
nudge
via:tsuomela
In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets."
june 2010 by Vaguery
Lee Byron » Else » Stream Graph Paper
may 2010 by Vaguery
"In February 2008, the New York Times published an unusual chart of box office revenues for 7500 movies over 21 years. The chart was based on a similar visualization, developed by the first author, that displayed trends in music listening. This paper describes the design decisions and algorithms behind these graphics, and discusses the reaction on the Web. We suggest that this type of complex layered graph is effective for displaying large data sets to a mass audience. We provide a mathematical analysis of how this layered graph relates to traditional stacked graphs and to techniques such as ThemeRiver, showing how each method is optimizing a different “energy function”. Finally, we discuss techniques for coloring and ordering the layers of such graphs. Throughout the paper, we emphasize the interplay between considerations of aesthetics and legibility."
visualization
dataviz
data-analysis
time-series
learning-from-data
answer-factory
may 2010 by Vaguery
Streamgraph code ported to JavaScript
may 2010 by Vaguery
"Lee Byron open-sourced his streamgraph code in Processing about a month ago. Jason Sundram has taken that and ported it to JavaScript, using Processing.js.
The algorithms are the same as that in the original, but of course the natural benefit is that people don't need Java to run it their browsers. Jason has also added a few features including dynamic sizing, more straightforward settings, and some interaction with zoom and hover control. Really nice work."
visualization
graphic-design
processing.js
library
graphing
data-analysis
dataviz
The algorithms are the same as that in the original, but of course the natural benefit is that people don't need Java to run it their browsers. Jason has also added a few features including dynamic sizing, more straightforward settings, and some interaction with zoom and hover control. Really nice work."
may 2010 by Vaguery
Think like a statistician – without the math | FlowingData
march 2010 by Vaguery
"Ask Why
Finally, and this is the most important thing I've learned, always ask why. When you see a blip in a graph, you should wonder why it's there. If you find some correlation, you should think about whether or not it makes any sense. If it does make sense, then cool, but if not, dig deeper. Numbers are great, but you have to remember that when humans are involved, errors are always a possibility."
statistics
pragmatism
data-analysis
modeling-is-not-mathematics
Finally, and this is the most important thing I've learned, always ask why. When you see a blip in a graph, you should wonder why it's there. If you find some correlation, you should think about whether or not it makes any sense. If it does make sense, then cool, but if not, dig deeper. Numbers are great, but you have to remember that when humans are involved, errors are always a possibility."
march 2010 by Vaguery
News — PyMVPA Home
march 2010 by Vaguery
"PyMVPA is a Python module intended to ease pattern classification analyses of large datasets. In the neuroimaging contexts such analysis techniques are also known as decoding or MVPA analysis. PyMVPA provides high-level abstraction of typical processing steps and a number of implementations of some popular algorithms. While it is not limited to the neuroimaging domain, it is eminently suited for such datasets. PyMVPA is truly free software (in every respect) and additionally requires nothing but free-software to run."
data-analysis
Python
machine-learning
open-source
free
visualization
statistics
exploratory-data-analysis
march 2010 by Vaguery
Listing Recent Prices for EC2 Spot Instances - Alestic.com
december 2009 by Vaguery
"The best way to approach auction type situations like this is often to simply list the maximum price you can afford. Your instance(s) will get run if and when the spot instance price reaches that price and you will regularly get charged less depending on what other users are bidding for their instances.
Though I don’t recommend trying to chase the spot instance price around, it is natural to be curious about what others have been paying and whether or not you might have a chance to get in with your bid."
spot-pricing
Amazon
economics
auction
pricing
EC2
data-analysis
Though I don’t recommend trying to chase the spot instance price around, it is natural to be curious about what others have been paying and whether or not you might have a chance to get in with your bid."
december 2009 by Vaguery
ggplot. had.co.nz
november 2009 by Vaguery
"ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics."
visualization
data-analysis
exploratory-data-analysis
statistics
graphics
graphs
pretty
software
open-source
documentation
ggplot2
R
november 2009 by Vaguery
The White House Doesn't Represent America ('s Surname First Letters)
november 2009 by Vaguery
"The thought of actually looking up who these people are, what they do, and why they might be visiting was excruciatingly boring, so I didn't do that. Instead, I looked at the distribution of first surname letters of the people who visited the White House, and then I compared that distribution to the actual frequency of the same first surname letters in the U.S. writ large."
data-analysis
politics
hypotheses
representativeness
charts
november 2009 by Vaguery
Black Swans Don’t Kill People, Black Swan Dealers Kill People « The Emergent Fool
october 2009 by Vaguery
"Decisions: The first type of decisions is simple, “binary”, i.e. you just care if something is true or false. Very true or very false does not matter. Someone is either pregnant or not pregnant. A statement is “true” or “false” with some confidence interval. (I call these M0 as, more technically, they depend on the zeroth moment, namely just on probability of events, and not their magnitude —you just care about “raw” probability). A biological experiment in the laboratory or a bet with a friend about the outcome of a soccer game belong to this category.
The second type of decisions is more complex. You do not just care of the frequency—but of the impact as well, or, even more complex, some function of the impact. So there is another layer of uncertainty of impact. (I call these M1+, as they depend on higher moments of the distribution). When you invest you do not care how many times you make or lose, you care about the expectation..."
economics
models
black-swans
storytelling
decision-making
decision-support
data-analysis
The second type of decisions is more complex. You do not just care of the frequency—but of the impact as well, or, even more complex, some function of the impact. So there is another layer of uncertainty of impact. (I call these M1+, as they depend on higher moments of the distribution). When you invest you do not care how many times you make or lose, you care about the expectation..."
october 2009 by Vaguery
Collecta Releases its Real Time API – issues challenge! « AltSearchEngines
september 2009 by Vaguery
"In conjunction with the API release, Collecta is launching a developer’s challenge with ChallengePost.com.
Dubbed “The AppMaster Challenge,” the contest will help drive the development of creative and powerful applications. From now through October 8th, developers can submit their Collecta-powered plug-in, webapp or application and the Collecta team will select the one that best exemplifies what real-time results can do. The winner will be announced on October 15th, and will receive both a featured spot as AppMaster Champion and a new 15″ MacBook Pro. There will be weekly prizes as well, and developers are encouraged to submit early and often."
search-engines
data
data-analysis
data-aggregation
competition
programming
Dubbed “The AppMaster Challenge,” the contest will help drive the development of creative and powerful applications. From now through October 8th, developers can submit their Collecta-powered plug-in, webapp or application and the Collecta team will select the one that best exemplifies what real-time results can do. The winner will be announced on October 15th, and will receive both a featured spot as AppMaster Champion and a new 15″ MacBook Pro. There will be weekly prizes as well, and developers are encouraged to submit early and often."
september 2009 by Vaguery
weather ring td on Flickr - Photo Sharing!
september 2009 by Vaguery
"3D print of a dataform based on 365 days of Canberra weather data (July 08 - June 09). Daily minimum and maximum temperature generate the profile of the outer edge; the holes show rainfall per week. Model generated with Processing, boolean operation in Blender, cleaned in Meshlab, printed by Shapeways. I'll be showing this piece in the Beginning, Middle, End exhibition at ANU School of Art Gallery, 18-24 September"
Processing
fabrication
generative-art
data-analysis
makers
want
september 2009 by Vaguery
Portfolio Safeguard: Optimization Risk Management VaR CVaR Drawdown Omega Replication Hedging Tracking Credit Risk Minimization MATLAB
august 2009 by Vaguery
Don't give a damn about the software. The benchmarks are interesting.
portfolio-theory
financial-engineering
benchmarking
test-cases
data-analysis
nudge
august 2009 by Vaguery
Gene Expression: The geography of online social networks
may 2009 by Vaguery
"If Facebook were being used to talk anonymously to a bunch of strangers, as with the early AOL chatrooms, then the adoption of this technology wouldn't show such a strong geographical pattern -- who cares if no one else in your state uses a chatroom, as long as there are enough people in total? This shows how firmly grounded in people's real lives their use of Facebook is; otherwise it would not spread in a more or less person-to-person fashion from its founding location."
geography
social-networks
Facebook
data-analysis
networks
may 2009 by Vaguery
Katya Vladislavleva - Tilburg University
may 2009 by Vaguery
See in particular Chapter 2, on Data Balancing. This is important stuff for those of us dealing with data-driven models and techniques, especially those not based on analytical closed form first-principles junk.
genetic-programming
modeling
data-analysis
learning-from-data
machine-learning
thesis
techniques
numerical-models
may 2009 by Vaguery
Infochimps.org: Free Redistributable Data Sets of Every Kind
april 2009 by Vaguery
"There are many sources to find out something about everything. Until now, there’s been no good place for you to find out everything about something.
The infochimps.org community is assembling and interconnecting the world's best repository for raw data -- a sort of giant free allmanac, with tables on everything you can put in a table. Built by data nerds, used by data nerds, it's a central source for the information you need to power the projects the world needs. (learn more: help|faq)"
data
data-analysis
openness
open-science
public-domain
information
visualization
archive
database
free
raw-data-now
The infochimps.org community is assembling and interconnecting the world's best repository for raw data -- a sort of giant free allmanac, with tables on everything you can put in a table. Built by data nerds, used by data nerds, it's a central source for the information you need to power the projects the world needs. (learn more: help|faq)"
april 2009 by Vaguery
Ad Hoc Data Analysis From The Unix Command Line - Wikibooks, collection of open-content textbooks
march 2009 by Vaguery
"Once upon a time, I was working with a colleague who needed to do some quick data analysis to get a handle on the scope of a problem. He was considering importing the data into a database or writing a program to parse and summarize that data. Either of these options would have taken hours at least, and possibly days. I wrote this on his whiteboard:
Your friends: cat, find, grep, wc, cut, sort, uniq
These simple commands can be combined to quickly answer the kinds of questions for which most people would turn to a database, if only the data were already in a database. You can quickly (often in seconds) form and test hypotheses about virtually any record oriented data source."
programming
Unix
command-line
tools
data-analysis
advice
Your friends: cat, find, grep, wc, cut, sort, uniq
These simple commands can be combined to quickly answer the kinds of questions for which most people would turn to a database, if only the data were already in a database. You can quickly (often in seconds) form and test hypotheses about virtually any record oriented data source."
march 2009 by Vaguery
The Commoditization of Massive Data Analysis - O'Reilly Radar
december 2008 by Vaguery
"We are at the beginning of what I call The Industrial Revolution of Data. We're not quite there yet, since most of the digital information available today is still individually "handmade": prose on web pages, data entered into forms, videos and music edited and uploaded to servers. But we are starting to see the rise of automatic data generation "factories" such as software logs, UPC scanners, RFID, GPS transceivers, video and audio feeds. These automated processes can stamp out data at volumes that will quickly dwarf the collective productivity of content authors worldwide. Meanwhile, disk capacities are growing exponentially, so the cost of archiving this data remains modest. And there are plenty of reasons to believe that this data has value in a wide variety of settings. The last step of the revolution is the commoditization of data analysis software, to serve a broad class of users."
data-analysis
analytics
business-models
trends
scalability
MapReduce
data-driven
economics
december 2008 by Vaguery
Socializing the analysis of the socialization of banking « Jon Udell
september 2008 by Vaguery
"When Allen Noren pointed to this visualization of U.S. government bailouts, I wanted to tweak it by showing the magnitudes on a timeline. I found this data set on Many Eyes, updated it with the number $700B, and made this bubble chart:..."
visualization
graphics
online
tools
collaboration
crowdsourcing
data-analysis
knowledge
management
explanation
proposal
september 2008 by Vaguery
R functions for time-series analysis
july 2008 by Vaguery
Noted for reference in Nudge project
via:arsyed
nudge
time-series
models
data-analysis
statistics
R
july 2008 by Vaguery
Texture Synthesis Links
march 2008 by Vaguery
Various potentially useful resources for texture synthesis and image analysis applications of genetic programming.
resources
library
machine-learning
datasets
data-analysis
data-mining
test-cases
march 2008 by Vaguery
(theinfo)
january 2008 by Vaguery
"This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange tips and tricks, dev
via:arthegall
algorithms
analytics
collaboration
collection
data
data-analysis
data-mining
hacking
open
research
tools
january 2008 by Vaguery
related tags
3d ⊕ academia ⊕ advice ⊕ algorithms ⊕ Amazon ⊕ America ⊕ analysis ⊕ analytics ⊕ Ann-Arbor ⊕ answer-factory ⊕ applied-mathematics ⊕ archive ⊕ archiving ⊕ argumentation ⊕ arXiv ⊕ auction ⊕ audio ⊕ behavioral-finance ⊕ benchmarking ⊕ big-data-will-lead-to-big-inference ⊕ black-swans ⊕ books ⊕ bubblicious ⊕ business-model ⊕ business-models ⊕ business-opportunity ⊕ cause-and-effect ⊕ charts ⊕ classification ⊕ clustering ⊕ collaboration ⊕ collection ⊕ command-line ⊕ commerce ⊕ communication ⊕ comparison ⊕ competition ⊕ complex-systems ⊕ complexology ⊕ compressed-sensing ⊕ computational-complexity ⊕ consulting ⊕ cookery ⊕ correlation ⊕ crowdsourcing ⊕ cuisine ⊕ data ⊕ data-aggregation ⊕ data-analysis ⊖ data-driven ⊕ data-mining ⊕ database ⊕ dataset ⊕ datasets ⊕ dataviz ⊕ decision-making ⊕ decision-support ⊕ definitely-nudge-targets ⊕ design-automation ⊕ development ⊕ distributed-processing ⊕ documentation ⊕ drinking-from-the-firehose ⊕ dynamics ⊕ EC2 ⊕ economics ⊕ emergent-design ⊕ engineering ⊕ epidemiology ⊕ examples ⊕ explanation ⊕ exploratory-data-analysis ⊕ fabrication ⊕ Facebook ⊕ favorites ⊕ FDA ⊕ finance ⊕ financial-engineering ⊕ financial-systems ⊕ free ⊕ freeware ⊕ FTW ⊕ functional-data-analysis ⊕ funding ⊕ futurism ⊕ generative-art ⊕ genetic-programming ⊕ geography ⊕ ggplot2 ⊕ Google ⊕ government ⊕ graph-layout ⊕ graph-theory ⊕ graphic-design ⊕ graphics ⊕ graphing ⊕ graphs ⊕ grid-computing ⊕ hacking ⊕ heuristics ⊕ history ⊕ hypotheses ⊕ image-analogies ⊕ imputation ⊕ information ⊕ infrastructure ⊕ intelligence-gathering ⊕ introductory ⊕ javascript ⊕ knowledge ⊕ language ⊕ learning-from-data ⊕ libraries ⊕ library ⊕ linguistics ⊕ lists ⊕ local ⊕ machine-learning ⊕ MacOS ⊕ mailing-lists ⊕ makers ⊕ management ⊕ MapReduce ⊕ marketing ⊕ mathematics ⊕ modeling ⊕ modeling-is-not-mathematics ⊕ models ⊕ music ⊕ n-grams ⊕ natural-language-processing ⊕ network-theory ⊕ networks ⊕ NLP ⊕ nonparametric-statistics ⊕ nudge ⊕ nudge-targets ⊕ numerical-models ⊕ online ⊕ online-learning ⊕ open ⊕ open-access ⊕ open-science ⊕ open-source ⊕ openness ⊕ optimization ⊕ p2p ⊕ papers ⊕ parking ⊕ pattern-discovery ⊕ phenomena ⊕ politics ⊕ portfolio-theory ⊕ pragmatism ⊕ prediction ⊕ preprint ⊕ pretty ⊕ pricing ⊕ Processing ⊕ processing.js ⊕ programming ⊕ propensity ⊕ proposal ⊕ protovis ⊕ public-domain ⊕ public-health ⊕ public-policy ⊕ python ⊕ quality-of-life ⊕ R ⊕ raw-data-now ⊕ reading ⊕ reference ⊕ representation ⊕ representativeness ⊕ research ⊕ resources ⊕ scalability ⊕ science ⊕ scientific-computing ⊕ search-algorithms ⊕ search-engines ⊕ segmentation ⊕ signal-processing ⊕ simulation ⊕ skynet ⊕ social-dynamics ⊕ social-media ⊕ social-networks ⊕ sociology ⊕ software ⊕ software-architecture ⊕ software-development ⊕ spot-pricing ⊕ standardized-testing ⊕ startups ⊕ statistics ⊕ storytelling ⊕ supervised-learning ⊕ sustainability ⊕ taste ⊕ technical-analysis ⊕ techniques ⊕ technology ⊕ test-cases ⊕ textures ⊕ thesis ⊕ time-series ⊕ timeseries ⊕ toolkit ⊕ tools ⊕ trading ⊕ transient ⊕ trends ⊕ turbines ⊕ tutorial ⊕ twitter ⊕ ubiquitous ⊕ Unix ⊕ user-experience ⊕ via:arsyed ⊕ via:arthegall ⊕ via:mahatm ⊕ via:mysticbob ⊕ via:o'reilly ⊕ via:tsuomela ⊕ via:yami ⊕ visualization ⊕ VTK ⊕ want ⊕ weather ⊕ web2.0 ⊕ wind-power ⊕ übergeekery ⊕Copy this bookmark: