Vaguery + data-analysis   68

Visualization series: Insight from Cleveland and Tufte on plotting numeric data by groups | Solomon Messing
"A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below).  Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points.  Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive."
visualization  data-analysis  communication  graphic-design  argumentation  statistics  ggplot2 
11 weeks ago by Vaguery
[1202.0077] An Interacting Particle Model for Clustering Euclidean Datasets
"In this paper we propose a method based on interacting particle physics, devised for clustering Euclidean datasets without initial constraints or conditions. We model any dataset as an interacting particle system, whose elements correspond to particles that interact through a simplified version of Lennard-Jones potentials. In so doing, mutual attractive interactions allow to identify groups of proximal particles. The main outcome of this modeling task is an adjacency matrix, taken as input by a community detection algorithm aimed to identify different partitions. The underlying conjecture is that, using a multiresolution analysis, the adopted model allows to find the right number of clusters for any given dataset. Experimental results, performed in comparison with a classical clustering algorithm, confirm this assumption."
clustering  data-analysis  algorithms  nudge-targets  distributed-processing 
february 2012 by Vaguery
[1201.5568] Dynamic trees for streaming and massive data contexts
"Data collection at a massive scale is becoming ubiquitous in a wide variety of settings, from vast offline databases to streaming real-time information. Learning algorithms deployed in such contexts must rely on single-pass inference, where the data history is never revisited. In streaming contexts, learning must also be temporally adaptive to remain up-to-date against unforeseen changes in the data generating mechanism. Although rapidly growing, the online Bayesian inference literature remains challenged by massive data and transient, evolving data streams. Non-parametric modelling techniques can prove particularly ill-suited, as the complexity of the model is allowed to increase with the sample size. In this work, we take steps to overcome these challenges by porting standard streaming techniques, like data discarding and downweighting, into a fully Bayesian framework via the use of informative priors and active learning heuristics. We showcase our methods by augmenting a modern non-parametric modelling framework, dynamic trees, and illustrate its performance on a number of practical examples. The end product is a powerful streaming regression and classification tool, whose performance compares favourably to the state-of-the-art."
data-analysis  learning-from-data  algorithms  drinking-from-the-firehose  nudge  data-mining 
january 2012 by Vaguery
[1112.2316] Complexity-entropy causality plane: a useful approach for distinguishing songs
Nowadays we are often faced with huge databases resulting from the rapid growth of data storage technologies. This is particularly true when dealing with music databases. In this context, it is essential to have techniques and tools able to discriminate properties from these massive sets. In this work, we report on a statistical analysis of more than ten thousand songs aiming to obtain a complexity hierarchy. Our approach is based on the estimation of the permutation entropy combined with an intensive complexity measure, building up the complexity-entropy causality plane. The results obtained indicate that this representation space is very promising to discriminate songs as well as to allow a relative quantitative comparison among songs. Additionally, we believe that the here-reported method may be applied in practical situations since it is simple, robust and has a fast numerical implementation.
signal-processing  classification  data-analysis  clustering  representation  music  nudge-targets 
january 2012 by Vaguery
Classifying Heart Sounds Challenge
"According to the World Health Organisation, cardiovascular diseases (CVDs) are the number one cause of death globally: more people die annually from CVDs than from any other cause. An estimated 17.1 million people died from CVDs in 2004, representing 29% of all global deaths. Of these deaths, an estimated 7.2 million were due to coronary heart disease. Any method which can help to detect signs of heart disease could therefore have a significant impact on world health. This challenge is to produce methods to do exactly that. Specifically, we are interested in creating the first level of screening of cardiac pathologies both in a Hospital environment by a doctor (using a digital stethoscope) and at home by the patient (using a mobile device).

The problem is of particular interest to machine learning researchers as it involves classification of audio sample data, where distinguishing between classes of interest is non-trivial. Data is gathered in real-world situations and frequently contains background noise of every conceivable type. The differences between heart sounds corresponding to different heart symptoms can also be extremely subtle and challenging to separate. Success in classifying this form of data requires extremely robust classifiers. Despite its medical significance, to date this is a relatively unexplored application for machine learning."
machine-learning  competition  nudge-targets  classification  segmentation  data-analysis  supervised-learning 
november 2011 by Vaguery
[1105.4953] A fast nearest neighbor search algorithm based on vector quantization
"In this article, we propose a new fast nearest neighbor search algorithm, based on vector quantization. Like many other branch and bound search algorithms [1,10], a preprocessing recursively partitions the data set into disjointed subsets until the number of points in each part is small enough. In doing so, a search-tree data structure is built. This preliminary recursive data-set partition is based on the vector quantization of the empirical distribution of the initial data-set. Unlike previously cited methods, this kind of partitions does not a priori allow to eliminate several brother nodes in the search tree with a single test. To overcome this difficulty, we propose an algorithm to reduce the number of tested brother nodes to a minimal list that we call "friend Voronoi cells". The complete description of the method requires a deeper insight into the properties of Delaunay triangulations and Voronoi diagrams"
algorithms  search-algorithms  data-analysis  nudge-targets 
october 2011 by Vaguery
Datameer snags $9.25M more to analyze massive amounts of data | VentureBeat
"Datameer, a company that allows users to analyze massive amounts of data without technical know-how, today announced a second round of funding for $9.25 million. The money will be used to hire additional employees for its engineering, sales, and marketing teams."
data-analysis  data-mining  startups  funding  bubblicious 
june 2011 by Vaguery
"Big Memory" Company Terracotta Snapped Up by Europe's Fourth Largest Software Company
"In-memory is a hot topic right now, thanks in part to SAP pushing its in-memory analytics platform HANA at Sapphire last week. HANA, however, is not a direct competitor to BigMemory. According to RedMonk co-founder James Governor, competitors include Oracle Coherence, IBM eXtreme Scale, Hazelcast and Gigaspaces.

"Indeed distributed cache is well known enough to be seen as a 'competitor' to NoSQL approaches," Governor wrote. "Both take load off the database - less database work generally means greater scalability""
software-architecture  distributed-processing  data-analysis  database  open-source 
may 2011 by Vaguery
[0807.1271] Semiparametric curve alignment and shift density estimation for biological data
"Assume that we observe a large number of curves, all of them with identical, although unknown, shape, but with a different random shift. The objective is to estimate the individual time shifts and their distribution. Such an objective appears in several biological applications like neuroscience or ECG signal processing, in which the estimation of the distribution of the elapsed time between repetitive pulses with a possibly low signal-noise ratio, and without a knowledge of the pulse shape is of interest. We suggest an M-estimator leading to a three-stage algorithm: we split our data set in blocks, on which the estimation of the shifts is done by minimizing a cost criterion based on a functional of the periodogram; the estimated shifts are then plugged into a standard density estimator. We show that under mild regularity assumptions the density estimate converges weakly to the true shift distribution. The theory is applied both to simulations and to alignment of real ECG signals.…"
data-analysis  statistics  algorithms  heuristics  exploratory-data-analysis  nudge  optimization  classification  time-series 
august 2010 by Vaguery
How did Weather Data Get Opened? - A Healthy Information Diet - InfoVegan.com
"Weather data didn’t come to be because of an Open Government Directive. It wasn’t created because of a White House mandate. Government did not release the data and then enterprising people built companies on top of it. It’s more accurate to make the argument that we have a national weather service because of one man’s deep desire to keep his job and to get promoted to colonel in the Army. It could be a vast network of lobbyists to help that man get promoted, or the vast network of lobbyists from shipping companies trying to get access to data already being created. Or it could be that it was just pretty obvious that access to weather data would save lives."
weather  open-access  data-analysis  big-data-will-lead-to-big-inference  public-policy  marketing 
august 2010 by Vaguery
[1008.1758] Stochastic Data Clustering
"In 1961 Herbert Simon and Albert Ando published the theory behind the long-term behavior of a dynamical system that can be described by a nearly completely decomposable matrix. Over the past fifty years this theory has been used in a variety of contexts, including queueing theory, computer performance, and ecology. In all these applications, the structure of the system is known and the point of interest is the various states the system passes through on its way to some long-term equilibrium. This paper looks at this problem from the other direction. That is, we develop a technique for using the evolution of the system to tell us about its initial structure, and we use this technique to develop a new algorithm for data clustering."
clustering  data-analysis  exploratory-data-analysis  statistics  algorithms 
august 2010 by Vaguery
Nanex - Market Crop Circle Of The Day
"As we continue to monitor the markets for evidence of Quote Stuffing and Strange Sequences (Crop Circles), we find that there are dozens if not hundreds of examples to choose from on any given day. As such, this page will be updated often with charts demonstrating this activity.

The common theme with the charts shown on this page is they are obviously all generated in code and are algorithmic. Some demonstrate bizarre price or size cycling, some demonstrate large burst of quotes in extremely short time frames and some will demonstrate both. In most cases these sequences are from a single exchange with no other exchange quoting in the same time frame."
machine-learning  trading  financial-engineering  skynet  data-analysis  emergent-design  technical-analysis  behavioral-finance 
august 2010 by Vaguery
Flash Crash Analysis - May 6'th 2010 - Part 4 - Nanex
"While analyzing HFT (High Frequency Trading) quote counts, we were shocked to find cases where one exchange was sending an extremely high number of quotes for one stock in a single second: as high as 5,000 quotes in 1 second! During May 6, there were hundreds of times that a single stock had over 1,000 quotes from one exchange in a single second. Even more disturbing, there doesn't seem to be any economic justification for this. In many of the cases, the bid/offer is well outside the National Best Bid/Offer (NBBO). We decided to analyze a handful of these cases in detail and graphed the sequential bid/offers to better understand them. What we discovered was a manipulative device with destabilizing effect."
trading  financial-systems  design-automation  complex-systems  emergent-design  engineering  data-analysis  skynet 
august 2010 by Vaguery
[1006.4531] Generalised network clustering and its dynamical implications
"A parameterisation of generalised network clustering, in the form of four-motif prevalences, is presented. This involves three real parameters that are conditional on one- two- and three-motif prevalences. Interpretations of these real parameters are presented that motivate a set of rewiring schemes to create appropriately clustered networks. Finally, the dynamical implications of higher order structure, as parameterised, for a contact process are considered."
clustering  network-theory  complexology  nudge-targets  algorithms  data-analysis  comparison 
august 2010 by Vaguery
[1005.5141] Constructing Positive Definite Elastic Kernels with Application to Time Series Classification
"This paper proposes some extensions to the work on kernels dedicated to string alignment (biological sequence alignment) based on the summing up of scores obtained by local alignments with gaps. The extensions we propose allow to construct, from classical time-warp distances, what we called summative time-warp kernels that are positive definite if some simple sufficient conditions are satisfied. Furthermore, from the same formalism, we derive a time-warp inner product that extends the usual euclidean inner product, providing the capability to handle discrete sequences or time series of variable lengths in an Hilbert space. The classification experiment we conducted, using either first near neighbor classifier or Support Vector Machine classifier leads to conclude that the positive definite elastic kernels we propose outperform the distance substituting kernels for the classical elastic distances we tested.…"
time-series  data-analysis  nudge-targets  classification  machine-learning  algorithms 
august 2010 by Vaguery
[1007.1075] Clustering Stability: An Overview
"A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications."
statistics  data-analysis  clustering  nonparametric-statistics  exploratory-data-analysis  heuristics 
august 2010 by Vaguery
[0903.5066] Modified-CS: Modifying Compressive Sensing for Problems with Partially Known Support
"We study the problem of reconstructing a sparse signal from a limited number of its linear projections when a part of its support is known, although the known part may contain some errors. The ``known" part of the support, denoted T, may be available from prior knowledge. Alternatively, in a problem of recursively reconstructing time sequences of sparse spatial signals, one may use the support estimate from the previous time instant as the ``known" part. The idea of our proposed solution (modified-CS) is to solve a convex relaxation of the following problem: find the signal that satisfies the data constraint and is sparsest outside of T.…"
compressed-sensing  algorithms  machine-learning  statistics  signal-processing  nudge-targets  data-analysis 
july 2010 by Vaguery
[1007.4748] Detecting influenza outbreaks by analyzing Twitter messages
"We analyze over 500 million Twitter messages from an eight month period and find that tracking a small number of flu-related keywords allows us to forecast future influenza rates with high accuracy, obtaining a 95% correlation with national health statistics. We then analyze the robustness of this approach to spurious keyword matches, and we propose a document classification component to filter these misleading messages. We find that this document classifier can reduce error rates by over half in simulated false alarm experiments, though more research is needed to develop methods that are robust in cases of extremely high noise."
epidemiology  twitter  social-media  data-analysis  public-health  big-data-will-lead-to-big-inference 
july 2010 by Vaguery
[1007.3799] Adapting to the Shifting Intent of Search Queries
"Search engines today present results that are often oblivious to abrupt shifts in intent. For example, the query `independence day' usually refers to a US holiday, but the intent of this query abruptly changed during the release of a major film by that name. … This paper shows that the signals a search engine receives can be used to both determine that a shift in intent has happened, as well as find a result that is now more relevant. We present a meta-algorithm that marries a classifier with a bandit algorithm to achieve regret that depends logarithmically on the number of query impressions, under certain assumptions. We provide strong evidence that this regret is close to the best achievable. Finally, via a series of experiments, we demonstrate that our algorithm outperforms prior approaches, particularly as the amount of intent-shifting traffic increases."
search-engines  search-algorithms  machine-learning  social-dynamics  algorithms  nudge-targets  intelligence-gathering  data-analysis 
july 2010 by Vaguery
[1007.4191] Fast Moment Estimation in Data Streams in Optimal Space
"We give a space-optimal algorithm with update time O(log^2(1/eps)loglog(1/eps)) for (1+eps)-approximating the pth frequency moment, 0 < p < 2, of a length-n vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous space-optimal algorithm of [Kane-Nelson-Woodruff, SODA 2010], which had update time Omega(1/eps^2)."
nudge-targets  algorithms  data-analysis  online-learning  machine-learning  computational-complexity  statistics 
july 2010 by Vaguery
Towards better analytical software | (Articles about R)
"Here are some thoughts on using existing statistical software for better analytics and/or business intelligence (reporting)…"
user-experience  software-development  business-opportunity  business-model  analytics  data-analysis 
july 2010 by Vaguery
Environment for DeveLoping KDD-Applications Supported by Index-Structures - Wikipedia, the free encyclopedia
"Environment for DeveLoping KDD-Applications Supported by Index-Structures (ELKI) is a Knowledge Discovery in Databases (KDD, "data mining") software framework developed for use in research and teaching by the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures."
clustering  algorithms  libraries  data-analysis  exploratory-data-analysis  statistics  nudge 
july 2010 by Vaguery
[1006.5273] Linear Detrending Subsequence Matching in Time-Series Databases
"Each time-series has its own linear trend, the directionality of a timeseries, and removing the linear trend is crucial to get the more intuitive matching results. Supporting the linear detrending in subsequence matching is a challenging problem due to a huge number of possible subsequences. In this paper we define this problem the linear detrending subsequence matching and propose its efficient index-based solution. To this end, we first present a notion of LD-windows (LD means linear detrending), which is obtained as follows: we eliminate the linear trend from a subsequence rather than each window itself and obtain LD-windows by dividing the subsequence into windows. Using the LD-windows we then present a lower bounding theorem for the index-based matching solution and formally prove its correctness.…"
time-series  data-mining  data-analysis  prediction  statistics  nudge-targets 
june 2010 by Vaguery
[1006.4330] Large gaps imputation in remote sensed imagery of the environment
"Imputation of missing data in large regions of satellite imagery is necessary when the acquired image has been damaged by shadows due to clouds, or information gaps produced by sensor failure.
The general approach for imputation of missing data, that could not be considered missed at random, suggests the use of other available data. Previous work, like local linear histogram matching, take advantage of a co-registered older image obtained by the same sensor, yielding good results in filling homogeneous regions, but poor results if the scenes being combined have radical differences in target radiance due, for example, to the presence of sun glint or snow.…"
nudge-targets  definitely-nudge-targets  imputation  statistics  machine-learning  data-analysis 
june 2010 by Vaguery
Protovis 3.2 released – more examples and layouts
"The most recent version of Protovis, the open-source visualization library that uses JavaScript and SVG, was just released not too long ago - this time with more layout and examples. This is especially helpful since Protovis was "designed to be learned by example." Among the new stuff is the ever popular streamgraphs, along with the force-directed layout. With only 10 to 20 lines of code, you'll have your viz, so lots of bang for the buck."
graphs  visualization  data-analysis  javascript  library  protovis  nudge 
june 2010 by Vaguery
What is data science? - O'Reilly Radar
"We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data?

In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets."
data-analysis  data-mining  learning-from-data  statistics  futurism  drinking-from-the-firehose  nudge  via:tsuomela 
june 2010 by Vaguery
Lee Byron » Else » Stream Graph Paper
"In February 2008, the New York Times published an unusual chart of box office revenues for 7500 movies over 21 years. The chart was based on a similar visualization, developed by the first author, that displayed trends in music listening. This paper describes the design decisions and algorithms behind these graphics, and discusses the reaction on the Web. We suggest that this type of complex layered graph is effective for displaying large data sets to a mass audience. We provide a mathematical analysis of how this layered graph relates to traditional stacked graphs and to techniques such as ThemeRiver, showing how each method is optimizing a different “energy function”. Finally, we discuss techniques for coloring and ordering the layers of such graphs. Throughout the paper, we emphasize the interplay between considerations of aesthetics and legibility."
visualization  dataviz  data-analysis  time-series  learning-from-data  answer-factory 
may 2010 by Vaguery
Streamgraph code ported to JavaScript
"Lee Byron open-sourced his streamgraph code in Processing about a month ago. Jason Sundram has taken that and ported it to JavaScript, using Processing.js.
The algorithms are the same as that in the original, but of course the natural benefit is that people don't need Java to run it their browsers. Jason has also added a few features including dynamic sizing, more straightforward settings, and some interaction with zoom and hover control. Really nice work."
visualization  graphic-design  processing.js  library  graphing  data-analysis  dataviz 
may 2010 by Vaguery
Think like a statistician – without the math | FlowingData
"Ask Why
Finally, and this is the most important thing I've learned, always ask why. When you see a blip in a graph, you should wonder why it's there. If you find some correlation, you should think about whether or not it makes any sense. If it does make sense, then cool, but if not, dig deeper. Numbers are great, but you have to remember that when humans are involved, errors are always a possibility."
statistics  pragmatism  data-analysis  modeling-is-not-mathematics 
march 2010 by Vaguery
News — PyMVPA Home
"PyMVPA is a Python module intended to ease pattern classification analyses of large datasets. In the neuroimaging contexts such analysis techniques are also known as decoding or MVPA analysis. PyMVPA provides high-level abstraction of typical processing steps and a number of implementations of some popular algorithms. While it is not limited to the neuroimaging domain, it is eminently suited for such datasets. PyMVPA is truly free software (in every respect) and additionally requires nothing but free-software to run."
data-analysis  Python  machine-learning  open-source  free  visualization  statistics  exploratory-data-analysis 
march 2010 by Vaguery
Listing Recent Prices for EC2 Spot Instances - Alestic.com
"The best way to approach auction type situations like this is often to simply list the maximum price you can afford. Your instance(s) will get run if and when the spot instance price reaches that price and you will regularly get charged less depending on what other users are bidding for their instances.

Though I don’t recommend trying to chase the spot instance price around, it is natural to be curious about what others have been paying and whether or not you might have a chance to get in with your bid."
spot-pricing  Amazon  economics  auction  pricing  EC2  data-analysis 
december 2009 by Vaguery
ggplot. had.co.nz
"ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics."
visualization  data-analysis  exploratory-data-analysis  statistics  graphics  graphs  pretty  software  open-source  documentation  ggplot2  R 
november 2009 by Vaguery
The White House Doesn't Represent America ('s Surname First Letters)
"The thought of actually looking up who these people are, what they do, and why they might be visiting was excruciatingly boring, so I didn't do that. Instead, I looked at the distribution of first surname letters of the people who visited the White House, and then I compared that distribution to the actual frequency of the same first surname letters in the U.S. writ large."
data-analysis  politics  hypotheses  representativeness  charts 
november 2009 by Vaguery
Black Swans Don’t Kill People, Black Swan Dealers Kill People « The Emergent Fool
"Decisions: The first type of decisions is simple, “binary”, i.e. you just care if something is true or false. Very true or very false does not matter. Someone is either pregnant or not pregnant. A statement is “true” or “false” with some confidence interval. (I call these M0 as, more technically, they depend on the zeroth moment, namely just on probability of events, and not their magnitude —you just care about “raw” probability). A biological experiment in the laboratory or a bet with a friend about the outcome of a soccer game belong to this category.

The second type of decisions is more complex. You do not just care of the frequency—but of the impact as well, or, even more complex, some function of the impact. So there is another layer of uncertainty of impact. (I call these M1+, as they depend on higher moments of the distribution). When you invest you do not care how many times you make or lose, you care about the expectation..."
economics  models  black-swans  storytelling  decision-making  decision-support  data-analysis 
october 2009 by Vaguery
Collecta Releases its Real Time API – issues challenge! « AltSearchEngines
"In conjunction with the API release, Collecta is launching a developer’s challenge with ChallengePost.com.

Dubbed “The AppMaster Challenge,” the contest will help drive the development of creative and powerful applications. From now through October 8th, developers can submit their Collecta-powered plug-in, webapp or application and the Collecta team will select the one that best exemplifies what real-time results can do. The winner will be announced on October 15th, and will receive both a featured spot as AppMaster Champion and a new 15″ MacBook Pro. There will be weekly prizes as well, and developers are encouraged to submit early and often."
search-engines  data  data-analysis  data-aggregation  competition  programming 
september 2009 by Vaguery
weather ring td on Flickr - Photo Sharing!
"3D print of a dataform based on 365 days of Canberra weather data (July 08 - June 09). Daily minimum and maximum temperature generate the profile of the outer edge; the holes show rainfall per week. Model generated with Processing, boolean operation in Blender, cleaned in Meshlab, printed by Shapeways. I'll be showing this piece in the Beginning, Middle, End exhibition at ANU School of Art Gallery, 18-24 September"
Processing  fabrication  generative-art  data-analysis  makers  want 
september 2009 by Vaguery
Gene Expression: The geography of online social networks
"If Facebook were being used to talk anonymously to a bunch of strangers, as with the early AOL chatrooms, then the adoption of this technology wouldn't show such a strong geographical pattern -- who cares if no one else in your state uses a chatroom, as long as there are enough people in total? This shows how firmly grounded in people's real lives their use of Facebook is; otherwise it would not spread in a more or less person-to-person fashion from its founding location."
geography  social-networks  Facebook  data-analysis  networks 
may 2009 by Vaguery
Katya Vladislavleva - Tilburg University
See in particular Chapter 2, on Data Balancing. This is important stuff for those of us dealing with data-driven models and techniques, especially those not based on analytical closed form first-principles junk.
genetic-programming  modeling  data-analysis  learning-from-data  machine-learning  thesis  techniques  numerical-models 
may 2009 by Vaguery
Infochimps.org: Free Redistributable Data Sets of Every Kind
"There are many sources to find out something about everything. Until now, there’s been no good place for you to find out everything about something.
The infochimps.org community is assembling and interconnecting the world's best repository for raw data -- a sort of giant free allmanac, with tables on everything you can put in a table. Built by data nerds, used by data nerds, it's a central source for the information you need to power the projects the world needs. (learn more: help|faq)"
data  data-analysis  openness  open-science  public-domain  information  visualization  archive  database  free  raw-data-now 
april 2009 by Vaguery
Ad Hoc Data Analysis From The Unix Command Line - Wikibooks, collection of open-content textbooks
"Once upon a time, I was working with a colleague who needed to do some quick data analysis to get a handle on the scope of a problem. He was considering importing the data into a database or writing a program to parse and summarize that data. Either of these options would have taken hours at least, and possibly days. I wrote this on his whiteboard:
Your friends: cat, find, grep, wc, cut, sort, uniq
These simple commands can be combined to quickly answer the kinds of questions for which most people would turn to a database, if only the data were already in a database. You can quickly (often in seconds) form and test hypotheses about virtually any record oriented data source."
programming  Unix  command-line  tools  data-analysis  advice 
march 2009 by Vaguery
The Commoditization of Massive Data Analysis - O'Reilly Radar
"We are at the beginning of what I call The Industrial Revolution of Data. We're not quite there yet, since most of the digital information available today is still individually "handmade": prose on web pages, data entered into forms, videos and music edited and uploaded to servers. But we are starting to see the rise of automatic data generation "factories" such as software logs, UPC scanners, RFID, GPS transceivers, video and audio feeds. These automated processes can stamp out data at volumes that will quickly dwarf the collective productivity of content authors worldwide. Meanwhile, disk capacities are growing exponentially, so the cost of archiving this data remains modest. And there are plenty of reasons to believe that this data has value in a wide variety of settings. The last step of the revolution is the commoditization of data analysis software, to serve a broad class of users."
data-analysis  analytics  business-models  trends  scalability  MapReduce  data-driven  economics 
december 2008 by Vaguery
Socializing the analysis of the socialization of banking « Jon Udell
"When Allen Noren pointed to this visualization of U.S. government bailouts, I wanted to tweak it by showing the magnitudes on a timeline. I found this data set on Many Eyes, updated it with the number $700B, and made this bubble chart:..."
visualization  graphics  online  tools  collaboration  crowdsourcing  data-analysis  knowledge  management  explanation  proposal 
september 2008 by Vaguery
Texture Synthesis Links
Various potentially useful resources for texture synthesis and image analysis applications of genetic programming.
resources  library  machine-learning  datasets  data-analysis  data-mining  test-cases 
march 2008 by Vaguery
(theinfo)
"This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange tips and tricks, dev
via:arthegall  algorithms  analytics  collaboration  collection  data  data-analysis  data-mining  hacking  open  research  tools 
january 2008 by Vaguery

related tags

3d  academia  advice  algorithms  Amazon  America  analysis  analytics  Ann-Arbor  answer-factory  applied-mathematics  archive  archiving  argumentation  arXiv  auction  audio  behavioral-finance  benchmarking  big-data-will-lead-to-big-inference  black-swans  books  bubblicious  business-model  business-models  business-opportunity  cause-and-effect  charts  classification  clustering  collaboration  collection  command-line  commerce  communication  comparison  competition  complex-systems  complexology  compressed-sensing  computational-complexity  consulting  cookery  correlation  crowdsourcing  cuisine  data  data-aggregation  data-analysis  data-driven  data-mining  database  dataset  datasets  dataviz  decision-making  decision-support  definitely-nudge-targets  design-automation  development  distributed-processing  documentation  drinking-from-the-firehose  dynamics  EC2  economics  emergent-design  engineering  epidemiology  examples  explanation  exploratory-data-analysis  fabrication  Facebook  favorites  FDA  finance  financial-engineering  financial-systems  free  freeware  FTW  functional-data-analysis  funding  futurism  generative-art  genetic-programming  geography  ggplot2  Google  government  graph-layout  graph-theory  graphic-design  graphics  graphing  graphs  grid-computing  hacking  heuristics  history  hypotheses  image-analogies  imputation  information  infrastructure  intelligence-gathering  introductory  javascript  knowledge  language  learning-from-data  libraries  library  linguistics  lists  local  machine-learning  MacOS  mailing-lists  makers  management  MapReduce  marketing  mathematics  modeling  modeling-is-not-mathematics  models  music  n-grams  natural-language-processing  network-theory  networks  NLP  nonparametric-statistics  nudge  nudge-targets  numerical-models  online  online-learning  open  open-access  open-science  open-source  openness  optimization  p2p  papers  parking  pattern-discovery  phenomena  politics  portfolio-theory  pragmatism  prediction  preprint  pretty  pricing  Processing  processing.js  programming  propensity  proposal  protovis  public-domain  public-health  public-policy  python  quality-of-life  R  raw-data-now  reading  reference  representation  representativeness  research  resources  scalability  science  scientific-computing  search-algorithms  search-engines  segmentation  signal-processing  simulation  skynet  social-dynamics  social-media  social-networks  sociology  software  software-architecture  software-development  spot-pricing  standardized-testing  startups  statistics  storytelling  supervised-learning  sustainability  taste  technical-analysis  techniques  technology  test-cases  textures  thesis  time-series  timeseries  toolkit  tools  trading  transient  trends  turbines  tutorial  twitter  ubiquitous  Unix  user-experience  via:arsyed  via:arthegall  via:mahatm  via:mysticbob  via:o'reilly  via:tsuomela  via:yami  visualization  VTK  want  weather  web2.0  wind-power  übergeekery 

Copy this bookmark:



description:


tags: