Operations, machine learning and premature babies
7 weeks ago by rahuldave
Julie Steele and I recently had lunch with Etsy's John Allspaw
and Kellan Elliott-McCrea. I'm not sure how we got there, but we made
a connection that was (to me) astonishing between web operations and
medical care for premature infants.
I've written several times about
IBM's work in neonatal intensive care at the University of Toronto.
In any neonatal intensive care unit (NICU), every baby is
connected to dozens of monitors. And each monitor is streaming
hundreds of readings per second into various data systems. They can
generate alerts if anything goes severely out of spec, but in normal
operation, they just generate a summary report for the doctor every
half hour or so.
IBM discovered that by applying machine learning to the full
data stream, they were able to diagnose some dangerous
infections a full day before any symptoms were noticeable to a human.
That's amazing in itself, but what's more important is what they were
looking for. I expected them to be looking for telltale spikes or
irregularities in the readings: perhaps not serious enough to generate
an alarm on their own, but still, the sort of things you'd intuitively
expect of a person about to become ill. But
according to Anjul Bhambhri, IBM's Vice President of Big Data, the
telltale signal wasn't spikes or irregularities, but
the opposite. There's a certain normal variation in heart rate, etc.,
throughout the day, and babies who were about to become sick didn't
exhibit the variation. Their heart rate was too normal; it didn't
change throughout the day as much as it should.
That observation strikes me as revolutionary. It's easy to detect problems
when something goes out of spec: If you have a fever, you know you're
sick. But how do you detect problems that don't set off an alarm?
How many diseases have early symptoms that
are too subtle for a human to notice, and only
accessible to a machine learning system that can sift through
gigabytes of data?
In our conversation, we started wondering how this applied to web
operations. We have gigabytes of data streaming off of our servers,
but the state of system and network monitoring hasn't changed in
years. We look for parameters that are out of spec, thresholds that
are crossed. And that's good for a lot of problems: You need to know
if the number of
packets coming into an interface suddenly goes to zero. But what if the
symptom we should look for is radically different? What if crossing a
threshold isn't what indicates trouble, but the disappearance (or
diminution) of some regular pattern? Is it possible that our computing
infrastructure also exhibits symptoms that
are too subtle for a human to notice but would easily be detectable via
machine learning?
We talked a bit about whether it was possible to alarm on the
first (and second) derivatives of some key parameters, and of course
it is. Doing so would require more sophistication than our current
monitoring systems have, but it's not too hard to imagine. But it
also misses the point. Once you know what to look for, it's
relatively easy to figure out how to detect it. IBM's insight wasn't
detecting the patterns that indicated a baby was about to become
sick, but using machine learning to figure out what the patterns
were. Can we do the same? It's not inconceivable, though it wouldn't
be easy.
Web operations has been on the forefront of "big data" since the
beginning. Long before we were talking about sentiment analysis or
recommendations engines, webmasters and system administrators were
analyzing problems by looking through gigabytes of server and system
logs, using tools that were primitive or non-existent.
MRTG and
HP's
OpenView were savage attempts to put together information dashboards
for IT groups. But at most enterprises, operations hasn't taken the
next step. Operations staff doesn't have the resources (neither
computational nor human) to apply machine intelligence to our
problems. We'd have to capture all the data coming off our our
servers for extended periods, not just the server logs that we capture
now, but any every kind of data we can collect: network data,
environmental data, I/O subsystem data, you name it. At a recent meetup
about finance, Abhi Mehta encouraged people to capture and save
"everything." He was talking about financial data, but the same
applies here. We'd need to build Hadoop clusters to
monitor our server farms; we'd need Hadoop clusters to monitor our
Hadoop clusters. It's a big investment of time and resources. If we
could make that investment, what would we find out? I bet that we'd be
surprised.
Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.
Save 20% on registration with the code RADAR20
Related:
Operations is a competitive advantage...
Operations: The secret sauce revisited
Data
Web_Ops_&_Performance
dataanalysis
insight
operations
webops
from google
and Kellan Elliott-McCrea. I'm not sure how we got there, but we made
a connection that was (to me) astonishing between web operations and
medical care for premature infants.
I've written several times about
IBM's work in neonatal intensive care at the University of Toronto.
In any neonatal intensive care unit (NICU), every baby is
connected to dozens of monitors. And each monitor is streaming
hundreds of readings per second into various data systems. They can
generate alerts if anything goes severely out of spec, but in normal
operation, they just generate a summary report for the doctor every
half hour or so.
IBM discovered that by applying machine learning to the full
data stream, they were able to diagnose some dangerous
infections a full day before any symptoms were noticeable to a human.
That's amazing in itself, but what's more important is what they were
looking for. I expected them to be looking for telltale spikes or
irregularities in the readings: perhaps not serious enough to generate
an alarm on their own, but still, the sort of things you'd intuitively
expect of a person about to become ill. But
according to Anjul Bhambhri, IBM's Vice President of Big Data, the
telltale signal wasn't spikes or irregularities, but
the opposite. There's a certain normal variation in heart rate, etc.,
throughout the day, and babies who were about to become sick didn't
exhibit the variation. Their heart rate was too normal; it didn't
change throughout the day as much as it should.
That observation strikes me as revolutionary. It's easy to detect problems
when something goes out of spec: If you have a fever, you know you're
sick. But how do you detect problems that don't set off an alarm?
How many diseases have early symptoms that
are too subtle for a human to notice, and only
accessible to a machine learning system that can sift through
gigabytes of data?
In our conversation, we started wondering how this applied to web
operations. We have gigabytes of data streaming off of our servers,
but the state of system and network monitoring hasn't changed in
years. We look for parameters that are out of spec, thresholds that
are crossed. And that's good for a lot of problems: You need to know
if the number of
packets coming into an interface suddenly goes to zero. But what if the
symptom we should look for is radically different? What if crossing a
threshold isn't what indicates trouble, but the disappearance (or
diminution) of some regular pattern? Is it possible that our computing
infrastructure also exhibits symptoms that
are too subtle for a human to notice but would easily be detectable via
machine learning?
We talked a bit about whether it was possible to alarm on the
first (and second) derivatives of some key parameters, and of course
it is. Doing so would require more sophistication than our current
monitoring systems have, but it's not too hard to imagine. But it
also misses the point. Once you know what to look for, it's
relatively easy to figure out how to detect it. IBM's insight wasn't
detecting the patterns that indicated a baby was about to become
sick, but using machine learning to figure out what the patterns
were. Can we do the same? It's not inconceivable, though it wouldn't
be easy.
Web operations has been on the forefront of "big data" since the
beginning. Long before we were talking about sentiment analysis or
recommendations engines, webmasters and system administrators were
analyzing problems by looking through gigabytes of server and system
logs, using tools that were primitive or non-existent.
MRTG and
HP's
OpenView were savage attempts to put together information dashboards
for IT groups. But at most enterprises, operations hasn't taken the
next step. Operations staff doesn't have the resources (neither
computational nor human) to apply machine intelligence to our
problems. We'd have to capture all the data coming off our our
servers for extended periods, not just the server logs that we capture
now, but any every kind of data we can collect: network data,
environmental data, I/O subsystem data, you name it. At a recent meetup
about finance, Abhi Mehta encouraged people to capture and save
"everything." He was talking about financial data, but the same
applies here. We'd need to build Hadoop clusters to
monitor our server farms; we'd need Hadoop clusters to monitor our
Hadoop clusters. It's a big investment of time and resources. If we
could make that investment, what would we find out? I bet that we'd be
surprised.
Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.
Save 20% on registration with the code RADAR20
Related:
Operations is a competitive advantage...
Operations: The secret sauce revisited
7 weeks ago by rahuldave
Privacy, contexts and Girls Around Me
8 weeks ago by rahuldave
Last weekend, I read two excellent articles on the problems that
privacy presents in a mobile, digital age. The Atlantic presented a
summary of Helen Nissenbaum's thoughts on
privacy and social norms: When we discuss the use of online
privacy, we too often forget the social context in which data exists,
even when we're talking about social media. And Amit Runchal posted
a TechCrunch article about the Girls Around Me fiasco,
"Creating Victims and Blaming Them," where he points out that the
victims of a service like Girls Around Me shouldn't be blamed for not
understanding the arcane privacy settings of services like Facebook:
"But ... the women signed up to be a part of this when they signed up to be on Facebook. No. What they signed up for was to be on Facebook. Our identities change depending on our context, no matter what permissions we have given to the Big Blue Eye. Denying us the right to this creates victims who then get blamed for it. 'Well ... you shouldn't have been on Facebook if you didn't want to...' No. Please recognize them as a person. Please recognize what that means.
Runchal's powerful "no" underscores the problem: People sign up with
Facebook and Foursquare (which quickly blocked Girls Around Me's
access to their API) to communicate with friends, to play games, to
find former classmates, and so on. They don't sign up to have their
data sold to the highest bidder. And while Facebook and Foursquare have a
legitimate right to run a profitable business, their users have a legitimate right to be
treated with some respect, and it's hard to construe hundreds of
inscrutable privacy settings as "respect." Even if you understand the
settings, it's next to impossible to block apps that
you don't even know about. Perhaps the only way to protect yourself is
a complete retreat into privacy, which defeats the purpose of Facebook.
Runchal's article demonstrates the principles for which Nissenbaum is
arguing. Privacy and data don't exist in the abstract. Privacy and
data always exist in social contexts, and problems occur when data is
taken out of that context. Users give data to Facebook all the time;
that's normal, and the service couldn't exist without that happening.
Hundreds of millions of people use and enjoy Facebook, so the company is
clearly doing a lot of things right. However, handing that same data
to another application rips it out of context: Facebook data on its
own might be fine, Facebook data crossed with location data from
Foursquare is getting fishy (almost any use of location data quickly
becomes "fishy"), and that combination published via an app
that's designed to encourage stalking has crossed the line.
Nissenbaum has articulated the general principle; Runchal has provided
an excellent case study.
In a similar vein, Tim O'Reilly has argued that we should regulate the
use of data, and expect data collectors to obey cultural norms about
reasonable and unreasonable uses of data. A doctor could share your
medical history with researchers, but not with an insurance company
that might use it to cancel your policy. That's the only way to get
the medical progress that comes from sharing data without the
chilling side effect of making medical care inaccessible to anyone who
actually needs it. Tim has
defended Facebook for being willing to push the limits of privacy
because that's the only way to find out what the new norms should be
and what benefits we can derive from new applications. That's fair
enough, and in this case (as I already pointed out), Foursquare was
quick to yank API access.
It's useful to imagine the same software with a slightly different
configuration. Girls Around Me has undeniably crossed a line. But
what if, instead of finding women, the app was Hackers Around Me?
That might be borderline creepy, but most people could live with it,
and it might even lead to some wonderful impromptu hackathons. EMTs
Around Me could save lives. I doubt that you'd need to change a
single line of code to implement either of these apps,
just some search strings. The problem isn't the
software itself, nor is it the victims, but what happens when you move
data from one context into another. Moving data about EMTs into
context where EMTs are needed is socially acceptable; moving data into
a context that facilitates stalking isn't acceptable, and shouldn't be.
The Atlantic's article about Nissenbaum ends with some pessimism about our
ability to define social norms surrounding privacy: "It's quite
difficult to figure out what the norms for a given situation might
be." And that's true. We don't yet know what cultural norms for
privacy are, let alone how to regulate for them, or how regulations
should evolve as technology evolves and cultural norms change. Locking
in our present norms
through some badly thought out regulation strikes me as a recipe for
disaster. I care much more about the TSA's scanners at an airport than
about Google photographing my house for Street View, but I'd be
ecstatically surprised to see legislation that reflected my
priorities. The New York Times reports that
cell phone tracking is routinely used by local law enforcement
agencies, with little or no court oversight; and in the current
climate, I'd be surprised to see privacy regulation that challenges
the widespread use and abuse of surveillance by the police.
But this isn't the time to throw up our hands. It isn't as if we're
completely lacking in clue. With that in mind, I'll give Amit
Runchal the last word:
"The line is this: When you begin speaking for another person without
their permission, you are doing something wrong. When you create
another identity for them without their permission, you are doing
something wrong. When you make people feel victimized who previously
did not feel that way, you are doing something wrong."
Those are words I can live by.
Related:
My Contrarian Stance on Facebook and privacy
The privacy arc
The end of social
Web_2.0
data
privacy
social
socialapplications
userdata
from google
privacy presents in a mobile, digital age. The Atlantic presented a
summary of Helen Nissenbaum's thoughts on
privacy and social norms: When we discuss the use of online
privacy, we too often forget the social context in which data exists,
even when we're talking about social media. And Amit Runchal posted
a TechCrunch article about the Girls Around Me fiasco,
"Creating Victims and Blaming Them," where he points out that the
victims of a service like Girls Around Me shouldn't be blamed for not
understanding the arcane privacy settings of services like Facebook:
"But ... the women signed up to be a part of this when they signed up to be on Facebook. No. What they signed up for was to be on Facebook. Our identities change depending on our context, no matter what permissions we have given to the Big Blue Eye. Denying us the right to this creates victims who then get blamed for it. 'Well ... you shouldn't have been on Facebook if you didn't want to...' No. Please recognize them as a person. Please recognize what that means.
Runchal's powerful "no" underscores the problem: People sign up with
Facebook and Foursquare (which quickly blocked Girls Around Me's
access to their API) to communicate with friends, to play games, to
find former classmates, and so on. They don't sign up to have their
data sold to the highest bidder. And while Facebook and Foursquare have a
legitimate right to run a profitable business, their users have a legitimate right to be
treated with some respect, and it's hard to construe hundreds of
inscrutable privacy settings as "respect." Even if you understand the
settings, it's next to impossible to block apps that
you don't even know about. Perhaps the only way to protect yourself is
a complete retreat into privacy, which defeats the purpose of Facebook.
Runchal's article demonstrates the principles for which Nissenbaum is
arguing. Privacy and data don't exist in the abstract. Privacy and
data always exist in social contexts, and problems occur when data is
taken out of that context. Users give data to Facebook all the time;
that's normal, and the service couldn't exist without that happening.
Hundreds of millions of people use and enjoy Facebook, so the company is
clearly doing a lot of things right. However, handing that same data
to another application rips it out of context: Facebook data on its
own might be fine, Facebook data crossed with location data from
Foursquare is getting fishy (almost any use of location data quickly
becomes "fishy"), and that combination published via an app
that's designed to encourage stalking has crossed the line.
Nissenbaum has articulated the general principle; Runchal has provided
an excellent case study.
In a similar vein, Tim O'Reilly has argued that we should regulate the
use of data, and expect data collectors to obey cultural norms about
reasonable and unreasonable uses of data. A doctor could share your
medical history with researchers, but not with an insurance company
that might use it to cancel your policy. That's the only way to get
the medical progress that comes from sharing data without the
chilling side effect of making medical care inaccessible to anyone who
actually needs it. Tim has
defended Facebook for being willing to push the limits of privacy
because that's the only way to find out what the new norms should be
and what benefits we can derive from new applications. That's fair
enough, and in this case (as I already pointed out), Foursquare was
quick to yank API access.
It's useful to imagine the same software with a slightly different
configuration. Girls Around Me has undeniably crossed a line. But
what if, instead of finding women, the app was Hackers Around Me?
That might be borderline creepy, but most people could live with it,
and it might even lead to some wonderful impromptu hackathons. EMTs
Around Me could save lives. I doubt that you'd need to change a
single line of code to implement either of these apps,
just some search strings. The problem isn't the
software itself, nor is it the victims, but what happens when you move
data from one context into another. Moving data about EMTs into
context where EMTs are needed is socially acceptable; moving data into
a context that facilitates stalking isn't acceptable, and shouldn't be.
The Atlantic's article about Nissenbaum ends with some pessimism about our
ability to define social norms surrounding privacy: "It's quite
difficult to figure out what the norms for a given situation might
be." And that's true. We don't yet know what cultural norms for
privacy are, let alone how to regulate for them, or how regulations
should evolve as technology evolves and cultural norms change. Locking
in our present norms
through some badly thought out regulation strikes me as a recipe for
disaster. I care much more about the TSA's scanners at an airport than
about Google photographing my house for Street View, but I'd be
ecstatically surprised to see legislation that reflected my
priorities. The New York Times reports that
cell phone tracking is routinely used by local law enforcement
agencies, with little or no court oversight; and in the current
climate, I'd be surprised to see privacy regulation that challenges
the widespread use and abuse of surveillance by the police.
But this isn't the time to throw up our hands. It isn't as if we're
completely lacking in clue. With that in mind, I'll give Amit
Runchal the last word:
"The line is this: When you begin speaking for another person without
their permission, you are doing something wrong. When you create
another identity for them without their permission, you are doing
something wrong. When you make people feel victimized who previously
did not feel that way, you are doing something wrong."
Those are words I can live by.
Related:
My Contrarian Stance on Facebook and privacy
The privacy arc
The end of social
8 weeks ago by rahuldave
Automated science, deep data and the paradox of information
8 weeks ago by rahuldave
A lot of great pieces have been written about the relatively recent surge in interest in big data and data science, but in this piece I want to address the importance of deep data analysis: what we can learn from the statistical outliers by drilling down and asking, "What's different here? What's special about these outliers and what do they tell us about our models and assumptions?”
The reason that big data proponents are so excited about the burgeoning data revolution isn't just because of the math. Don't get me wrong, the math is fun, but we're excited because we can begin to distill patterns that were previously invisible to us due to a lack of information.
That's big data.
Of course, data are just a collection of facts; bits of information that are only given context — assigned meaning and importance — by human minds. It's not until we do something with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.
And therein lies the rub.
Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?
(Semi)Automated science
In 2010, Cornell researchers Michael Schmidt and Hod Lipson published a groundbreaking paper in "Science" titled, "Distilling Free-Form Natural Laws from Experimental Data". The premise was simple, and it essentially boiled down to the question, "can we algorithmically extract models to fit our data?"
So they hooked up a double pendulum — a seemingly chaotic system whose movements are governed by classical mechanics — and trained a machine learning algorithm on the motion data.
Their results were astounding.
In a matter of minutes the algorithm converged on Newton's second law of motion: f = ma. What took humanity tens of thousands of years to accomplish was completed on 32-cores in essentially no time at all.
In 2011, some neuroscience colleagues of mine, lead by Tal Yarkoni, published a paper in "Nature Methods" titled "Large-scale automated synthesis of human functional neuroimaging data". In this paper the authors sought to extract patterns from the overwhelming flood of brain imaging research.
To do this they algorithmically extracted the 3D coordinates of significant brain activations from thousands of neuroimaging studies, along with words that frequently appeared in each study. Using these two pieces of data along with some simple (but clever) mathematical tools, they were able to create probabilistic maps of brain activation for any given term.
In other words, you type in a word such as "learning" on their website search and visualization tool, NeuroSynth, and they give you back a pattern of brain activity that you should expect to see during a learning task.
But that's not all. Given a pattern of brain activation, the system can perform a reverse inference, asking, "given the data that I'm observing, what is the most probable behavioral state that this brain is in?"
Similarly, in late 2010, my wife (Jessica Voytek) and I undertook a project to algorithmically discover associations between concepts in the peer-reviewed neuroscience literature. As a neuroscientist, the goal of my research is to understand relationships between the human brain, behavior, physiology, and disease. Unfortunately, the facts that tie all that information together are locked away in more than 21 million static peer-reviewed scientific publications.
How many undergrads would I need to hire to read through that many papers? Any volunteers?
Even more mind-boggling, each year more than 30,000 neuroscientists attend the annual Society for Neuroscience conference. If we assume that only two-thirds of those people actually do research, and if we assume that they only work a meager (for the sciences) 40 hours a week, that's around 40 million person-hours dedicated to but one branch of the sciences.
Annually.
This means that in the 10 years I've been attending that conference, more than 400 million person-hours have gone toward the pursuit of understanding the brain. Humanity built the pyramids in 30 years. The Apollo Project got us to the moon in about eight.
So my wife and I said to ourselves, "there has to be a better way".
Which lead us to create brainSCANr, a simple (simplistic?) tool (currently itself under peer review) that makes the assumption that the more often that two concepts appear together in the titles or abstracts of published papers, the more likely they are to be associated with one another.
For example, if 10,000 papers mention "Alzheimer's disease" that also mention "dementia," then Alzheimer's disease is probably related to dementia. In fact, there are 17,087 papers that mention Alzheimer's and dementia, whereas there are only 14 papers that mention Alzheimer's and, for example, creativity.
From this, we built what we're calling the "cognome", a mapping between brain structure, function, and disease.
Big data, data mining, and machine learning are becoming critical tools in the modern scientific arsenal. Examples abound: text mining recipes to find cultural food taste preferences, analyzing cultural trends via word use in books ("culturomics"), identifying seasonality of mood from tweets, and so on.
But so what?
Deep data
What those three studies show us is that it's possible to automate, or at least semi-automate, critical aspects of the scientific method itself. Schmidt and Lipson show that it is possible to extract equations that perfectly model even seemingly chaotic systems. Yarkoni and colleagues show that it is possible to infer a complex behavioral state given input brian data.
My wife and I wanted to show that brainSCANr could be put to work for something more useful than just quantifying relationships between terms. So we created a simple algorithm to perform what we're calling "semi-automated hypothesis generation," which is predicated on a basic "the friend of a friend should be a friend" concept.
In the example below, the neurotransmitter "serotonin" has thousands of shared publications with "migraine," as well as with the brain region "striatum." However, migraine and striatum only share 16 publications.
That's very odd. Because in medicine there is a serotonin hypothesis for the root cause of migraines. And we (neuroscientists) know that serotonin is released in the striatum to modulate brain activity in that region. Given that those two things are true, why is there so little research regarding the role of the striatum in migraines?
Perhaps there's a missing connection?
Such missing links and other outliers in our models are the essence of deep data analytics. Sure, any data scientist worth their salt can take a mountain of data and reduce it down to a few simple plots. And such plots are important because they tell a story. But those aren't the only stories that our data can tell us.
For example, in my geoanalytics work as the data evangelist for Uber, I put some of my (definitely rudimentary) neuroscience network analytic skills to work to figure out how people move from neighborhood to neighborhood in San Francisco.
At one point, I checked to see if men and women moved around the city differently. A very simple regression model showed that the number of men who go to any given neighborhood significantly predicts the number of woman who go to that same neighborhood.
No big deal.
But what's cool was seeing where the outliers were. When I looked at the models' residuals, that's where I found the far more interesting story. While it's good to have a model that fits your data, knowing where the model breaks down is not only important for internal metrics, but it also makes for a more interesting story:
What's happening in the Marina district that so many more women want to go there? And why are there so many more men in SoMa?
The paradox of information
The interpretation of big data analytics can be a messy game. Maybe there are more men in SoMa because that's where AT&T Park is. But maybe there are just five guys who live in SoMa who happen to take Uber 100 times more often than average.
While data-driven posts make for fun reading (and writing), in the sciences we need to be more careful that we don't fall prey to ad hoc, just-so stories that sound perfectly reasonable and plausible, but which we cannot conclusively prove.
In 2008, psychologists David McCabe and Alan Castel published a paper in the journal "Cognition," titled, "Seeing is believing: The effect of brain images on judgments of scientific reasoning". In that paper, they showed that summaries of cognitive neuroscience findings that are accompanied by an image of a brain scan were rated as more credible by the readers.
This should cause any data scientist serious concern. In fact, I've formulated three laws of statistical analyses:
The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
Any sufficiently advanced statistics can trick people into believing the results reflect truth.
The first law is closely related to the "bike shed effect" (also known as Parkinson's Law of Triviality) which states that, "the time spent on any item of the agenda will be in inverse proportion to the sum involved."
In other words, if you try to build a simple thing such as a public bike shed, there will be endless town hall discussions wherein people argue over trivial details such as the color of the door. But if you want to build a nuclear power plant — a project so vast and complicated that most people can't understand it — people will defer to expert opinion.
Such is the case with statistics.
If[…]
Data
dataanalysis
dataconclusions
dataethics
datascience
scientificmethod
scientists
from google
The reason that big data proponents are so excited about the burgeoning data revolution isn't just because of the math. Don't get me wrong, the math is fun, but we're excited because we can begin to distill patterns that were previously invisible to us due to a lack of information.
That's big data.
Of course, data are just a collection of facts; bits of information that are only given context — assigned meaning and importance — by human minds. It's not until we do something with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.
And therein lies the rub.
Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?
(Semi)Automated science
In 2010, Cornell researchers Michael Schmidt and Hod Lipson published a groundbreaking paper in "Science" titled, "Distilling Free-Form Natural Laws from Experimental Data". The premise was simple, and it essentially boiled down to the question, "can we algorithmically extract models to fit our data?"
So they hooked up a double pendulum — a seemingly chaotic system whose movements are governed by classical mechanics — and trained a machine learning algorithm on the motion data.
Their results were astounding.
In a matter of minutes the algorithm converged on Newton's second law of motion: f = ma. What took humanity tens of thousands of years to accomplish was completed on 32-cores in essentially no time at all.
In 2011, some neuroscience colleagues of mine, lead by Tal Yarkoni, published a paper in "Nature Methods" titled "Large-scale automated synthesis of human functional neuroimaging data". In this paper the authors sought to extract patterns from the overwhelming flood of brain imaging research.
To do this they algorithmically extracted the 3D coordinates of significant brain activations from thousands of neuroimaging studies, along with words that frequently appeared in each study. Using these two pieces of data along with some simple (but clever) mathematical tools, they were able to create probabilistic maps of brain activation for any given term.
In other words, you type in a word such as "learning" on their website search and visualization tool, NeuroSynth, and they give you back a pattern of brain activity that you should expect to see during a learning task.
But that's not all. Given a pattern of brain activation, the system can perform a reverse inference, asking, "given the data that I'm observing, what is the most probable behavioral state that this brain is in?"
Similarly, in late 2010, my wife (Jessica Voytek) and I undertook a project to algorithmically discover associations between concepts in the peer-reviewed neuroscience literature. As a neuroscientist, the goal of my research is to understand relationships between the human brain, behavior, physiology, and disease. Unfortunately, the facts that tie all that information together are locked away in more than 21 million static peer-reviewed scientific publications.
How many undergrads would I need to hire to read through that many papers? Any volunteers?
Even more mind-boggling, each year more than 30,000 neuroscientists attend the annual Society for Neuroscience conference. If we assume that only two-thirds of those people actually do research, and if we assume that they only work a meager (for the sciences) 40 hours a week, that's around 40 million person-hours dedicated to but one branch of the sciences.
Annually.
This means that in the 10 years I've been attending that conference, more than 400 million person-hours have gone toward the pursuit of understanding the brain. Humanity built the pyramids in 30 years. The Apollo Project got us to the moon in about eight.
So my wife and I said to ourselves, "there has to be a better way".
Which lead us to create brainSCANr, a simple (simplistic?) tool (currently itself under peer review) that makes the assumption that the more often that two concepts appear together in the titles or abstracts of published papers, the more likely they are to be associated with one another.
For example, if 10,000 papers mention "Alzheimer's disease" that also mention "dementia," then Alzheimer's disease is probably related to dementia. In fact, there are 17,087 papers that mention Alzheimer's and dementia, whereas there are only 14 papers that mention Alzheimer's and, for example, creativity.
From this, we built what we're calling the "cognome", a mapping between brain structure, function, and disease.
Big data, data mining, and machine learning are becoming critical tools in the modern scientific arsenal. Examples abound: text mining recipes to find cultural food taste preferences, analyzing cultural trends via word use in books ("culturomics"), identifying seasonality of mood from tweets, and so on.
But so what?
Deep data
What those three studies show us is that it's possible to automate, or at least semi-automate, critical aspects of the scientific method itself. Schmidt and Lipson show that it is possible to extract equations that perfectly model even seemingly chaotic systems. Yarkoni and colleagues show that it is possible to infer a complex behavioral state given input brian data.
My wife and I wanted to show that brainSCANr could be put to work for something more useful than just quantifying relationships between terms. So we created a simple algorithm to perform what we're calling "semi-automated hypothesis generation," which is predicated on a basic "the friend of a friend should be a friend" concept.
In the example below, the neurotransmitter "serotonin" has thousands of shared publications with "migraine," as well as with the brain region "striatum." However, migraine and striatum only share 16 publications.
That's very odd. Because in medicine there is a serotonin hypothesis for the root cause of migraines. And we (neuroscientists) know that serotonin is released in the striatum to modulate brain activity in that region. Given that those two things are true, why is there so little research regarding the role of the striatum in migraines?
Perhaps there's a missing connection?
Such missing links and other outliers in our models are the essence of deep data analytics. Sure, any data scientist worth their salt can take a mountain of data and reduce it down to a few simple plots. And such plots are important because they tell a story. But those aren't the only stories that our data can tell us.
For example, in my geoanalytics work as the data evangelist for Uber, I put some of my (definitely rudimentary) neuroscience network analytic skills to work to figure out how people move from neighborhood to neighborhood in San Francisco.
At one point, I checked to see if men and women moved around the city differently. A very simple regression model showed that the number of men who go to any given neighborhood significantly predicts the number of woman who go to that same neighborhood.
No big deal.
But what's cool was seeing where the outliers were. When I looked at the models' residuals, that's where I found the far more interesting story. While it's good to have a model that fits your data, knowing where the model breaks down is not only important for internal metrics, but it also makes for a more interesting story:
What's happening in the Marina district that so many more women want to go there? And why are there so many more men in SoMa?
The paradox of information
The interpretation of big data analytics can be a messy game. Maybe there are more men in SoMa because that's where AT&T Park is. But maybe there are just five guys who live in SoMa who happen to take Uber 100 times more often than average.
While data-driven posts make for fun reading (and writing), in the sciences we need to be more careful that we don't fall prey to ad hoc, just-so stories that sound perfectly reasonable and plausible, but which we cannot conclusively prove.
In 2008, psychologists David McCabe and Alan Castel published a paper in the journal "Cognition," titled, "Seeing is believing: The effect of brain images on judgments of scientific reasoning". In that paper, they showed that summaries of cognitive neuroscience findings that are accompanied by an image of a brain scan were rated as more credible by the readers.
This should cause any data scientist serious concern. In fact, I've formulated three laws of statistical analyses:
The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
Any sufficiently advanced statistics can trick people into believing the results reflect truth.
The first law is closely related to the "bike shed effect" (also known as Parkinson's Law of Triviality) which states that, "the time spent on any item of the agenda will be in inverse proportion to the sum involved."
In other words, if you try to build a simple thing such as a public bike shed, there will be endless town hall discussions wherein people argue over trivial details such as the color of the door. But if you want to build a nuclear power plant — a project so vast and complicated that most people can't understand it — people will defer to expert opinion.
Such is the case with statistics.
If[…]
8 weeks ago by rahuldave
Visualization of the Week: The U.S. Wind Map
8 weeks ago by rahuldave
As Flowing Data's Nathan Yau remarked earlier this week:
"I get kind of giddy whenever I see a tweet from Martin Wattenberg and Fernanda Viegas. They rarely tweet, but when they do, it's usually because they've released a new project and they always announce it simultaneously."
Tweets this week from the two renowned data visualization artists unveiled their latest project: Wind Map. It's a beautiful visualization — Van Gogh-like with its swirling patterns — created using "close to live" data of the country's surface wind speed.
The trajectory of wind flowing across the U.S. Click to see animated/interactive version.
In addition to watching the patterns of the winds flow, you can click to zoom in and see more information about wind speed and direction.
Data for the map comes from the National Digital Forecast Database. As such, the data is currently U.S.-only, but Wattenberg and Viegas say they're interested in expanding the Wind Map. ("If you know of a source of detailed live wind data for the entire globe," they write, "please let us know.")
Found a great visualization? Tell us about it
This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.
Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).
Save 20% on registration with the code RADAR20
More Visualizations:
Anachronistic language in "Mad Men"
Visualizing Big History
Kids Count in Washington, D.C.
Visualizing the Strata Conference
More Visualizations of the Week
Data
visualization
visualizationoftheweek
windmap
from google
"I get kind of giddy whenever I see a tweet from Martin Wattenberg and Fernanda Viegas. They rarely tweet, but when they do, it's usually because they've released a new project and they always announce it simultaneously."
Tweets this week from the two renowned data visualization artists unveiled their latest project: Wind Map. It's a beautiful visualization — Van Gogh-like with its swirling patterns — created using "close to live" data of the country's surface wind speed.
The trajectory of wind flowing across the U.S. Click to see animated/interactive version.
In addition to watching the patterns of the winds flow, you can click to zoom in and see more information about wind speed and direction.
Data for the map comes from the National Digital Forecast Database. As such, the data is currently U.S.-only, but Wattenberg and Viegas say they're interested in expanding the Wind Map. ("If you know of a source of detailed live wind data for the entire globe," they write, "please let us know.")
Found a great visualization? Tell us about it
This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.
Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).
Save 20% on registration with the code RADAR20
More Visualizations:
Anachronistic language in "Mad Men"
Visualizing Big History
Kids Count in Washington, D.C.
Visualizing the Strata Conference
More Visualizations of the Week
8 weeks ago by rahuldave
Designing great data products
9 weeks ago by rahuldave
By Jeremy Howard, Margit Zwemer and Mike Loukides
Sections
Objective-based data products
The Model Assembly Line:A case study of Optimal Decisions Group
Drivetrain Approach to recommender systems
Optimizing lifetime customer value
Best practices from physical data products
The future for data products
Download this free report
In the past few years, we've seen many data products based on predictive modeling. These products range from weather forecasting to recommendation engines to services that predict airline flight times more accurately than the airline itself. But these products are still just making predictions, rather than asking what action they want someone to take as a result of a prediction. Prediction technology can be interesting and mathematically elegant, but we need to take the next step. The technology exists to build data products that can revolutionize entire industries. So, why aren't we building them?
To jump-start this process, we suggest a four-step approach that has already transformed the insurance industry. We call it the Drivetrain Approach, inspired by the emerging field of self-driving vehicles. Engineers start by defining a clear objective: They want a car to drive safely from point A to point B without human intervention. Great predictive modeling is an important part of the solution, but it no longer stands on its own; as products become more sophisticated, it disappears into the plumbing. Someone using Google's self-driving car is completely unaware of the hundreds (if not thousands) of models and the petabytes of data that make it work. But as data scientists build increasingly sophisticated products, they need a systematic design approach. We don't claim that the Drivetrain Approach is the best or only method; our goal is to start a dialog within the data science and business communities to advance our collective vision.
Objective-based data products
We are entering the era of data as drivetrain, where we use data not just to generate more data (in the form of predictions), but use data to produce actionable outcomes. That is the goal of the Drivetrain Approach. The best way to illustrate this process is with a familiar data product: search engines. Back in 1997, AltaVista was king of the algorithmic search world. While their models were good at finding relevant websites, the answer the user was most interested in was often buried on page 100 of the search results. Then, Google came along and transformed online search by beginning with a simple question: What is the user's main objective in typing in a search query?
The four steps in the Drivetrain Approach. Click to enlarge.
Google realized that the objective was to show the most relevant search result; for other companies, it might be increasing profit, improving the customer experience, finding the best path for a robot, or balancing the load in a data center. Once we have specified the goal, the second step is to specify what inputs of the system we can control, the levers we can pull to influence the final outcome. In Google's case, they could control the ranking of the search results. The third step was to consider what new data they would need to produce such a ranking; they realized that the implicit information regarding which pages linked to which other pages could be used for this purpose. Only after these first three steps do we begin thinking about building the predictive models. Our objective and available levers, what data we already have and what additional data we will need to collect, determine the models we can build. The models will take both the levers and any uncontrollable variables as their inputs; the outputs from the models can be combined to predict the final state for our objective.
Step 4 of the Drivetrain Approach for Google is now part of tech history: Larry Page and Sergey Brin invented the graph traversal algorithm PageRank and built an engine on top of it that revolutionized search. But you don't have to invent the next PageRank to build a great data product. We will show a systematic approach to step 4 that doesn't require a PhD in computer science.
The Model Assembly Line: A case study of Optimal Decisions Group
Optimizing for an actionable outcome over the right predictive models can be a company's most important strategic decision. For an insurance company, policy price is the product, so an optimal pricing model is to them what the assembly line is to automobile manufacturing. Insurers have centuries of experience in prediction, but as recently as 10 years ago, the insurance companies often failed to make optimal business decisions about what price to charge each new customer. Their actuaries could build models to predict a customer's likelihood of being in an accident and the expected value of claims. But those models did not solve the pricing problem, so the insurance companies would set a price based on a combination of guesswork and market studies.
This situation changed in 1999 with a company called Optimal Decisions Group (ODG). ODG approached this problem with an early use of the Drivetrain Approach and a practical take on step 4 that can be applied to a wide range of problems. They began by defining the objective that the insurance company was trying to achieve: setting a price that maximizes the net-present value of the profit from a new customer over a multi-year time horizon, subject to certain constraints such as maintaining market share. From there, they developed an optimized pricing process that added hundreds of millions of dollars to the insurers' bottom lines. [Note: Co-author Jeremy Howard founded ODG.]
ODG identified which levers the insurance company could control: what price to charge each customer, what types of accidents to cover, how much to spend on marketing and customer service, and how to react to their competitors' pricing decisions. They also considered inputs outside of their control, like competitors' strategies, macroeconomic conditions, natural disasters, and customer "stickiness." They considered what additional data they would need to predict a customer's reaction to changes in price. It was necessary to build this dataset by randomly changing the prices of hundreds of thousands of policies over many months. While the insurers were reluctant to conduct these experiments on real customers, as they'd certainly lose some customers as a result, they were swayed by the huge gains that optimized policy pricing might deliver. Finally, ODG started to design the models that could be used to optimize the insurer's profit.
Drivetrain Step 4: The Model Assembly Line. Picture a Model Assembly Line for data products that transforms the raw data into an actionable outcome. The Modeler takes the raw data and converts it into slightly more refined predicted data. Click to enlarge.
The first component of ODG's Modeler was a model of price elasticity (the probability that a customer will accept a given price) for new policies and for renewals. The price elasticity model is a curve of price versus the probability of the customer accepting the policy conditional on that price. This curve moves from almost certain acceptance at very low prices to almost never at high prices.
The second component of ODG's Modeler related price to the insurance company's profit, conditional on the customer accepting this price. The profit for a very low price will be in the red by the value of expected claims in the first year, plus any overhead for acquiring and servicing the new customer. Multiplying these two curves creates a final curve that shows price versus expected profit (see Expected Profit figure, below). The final curve has a clearly identifiable local maximum that represents the best price to charge a customer for the first year.
Expected profit.
ODG also built models for customer retention. These models predicted whether customers would renew their policies in one year, allowing for changes in price and willingness to jump to a competitor. These additional models allow the annual models to be combined to predict profit from a new customer over the next five years.
This new suite of models is not a final answer because it only identifies the outcome for a given set of inputs. The next machine on the assembly line is a Simulator, which lets ODG ask the "what if" questions to see how the levers affect the distribution of the final outcome. The expected profit curve is just a slice of the surface of possible outcomes. To build that entire surface, the Simulator runs the models over a wide range of inputs. The operator can adjust the input levers to answer specific questions like, "What will happen if our company offers the customer a low teaser price in year one but then raises the premiums in year two?" They can also explore how the distribution of profit is shaped by the inputs outside of the insurer's control: "What if the economy crashes and the customer loses his job? What if a 100-year flood hits his home? If a new competitor enters the market and our company does not react, what will be the impact on our bottom line?" Because the simulation is at a per-policy level, the insurer can view the impact of a given set of price changes on revenue, market share, and other metrics over time.
The Simulator's result is fed to an Optimizer, which takes the surface of possible outcomes and identifies the highest point. The Optimizer not only finds the best outcomes, it can also identify catastrophic outcomes and show how to avoid them. There are many different optimization techniques to choose from (see see sidebar, below), but it is a well-understood field with robust and accessible solutions. ODG's competitors use different techniques to find an optimal price, but they are shipping the same over-all data product. What matters is that using a Drivetrain Approach combined with a Model Assembly Line bridges the g[…]
Data
datadesign
dataprocess
dataproduct
datascience
drivetrainapproach
from google
Sections
Objective-based data products
The Model Assembly Line:A case study of Optimal Decisions Group
Drivetrain Approach to recommender systems
Optimizing lifetime customer value
Best practices from physical data products
The future for data products
Download this free report
In the past few years, we've seen many data products based on predictive modeling. These products range from weather forecasting to recommendation engines to services that predict airline flight times more accurately than the airline itself. But these products are still just making predictions, rather than asking what action they want someone to take as a result of a prediction. Prediction technology can be interesting and mathematically elegant, but we need to take the next step. The technology exists to build data products that can revolutionize entire industries. So, why aren't we building them?
To jump-start this process, we suggest a four-step approach that has already transformed the insurance industry. We call it the Drivetrain Approach, inspired by the emerging field of self-driving vehicles. Engineers start by defining a clear objective: They want a car to drive safely from point A to point B without human intervention. Great predictive modeling is an important part of the solution, but it no longer stands on its own; as products become more sophisticated, it disappears into the plumbing. Someone using Google's self-driving car is completely unaware of the hundreds (if not thousands) of models and the petabytes of data that make it work. But as data scientists build increasingly sophisticated products, they need a systematic design approach. We don't claim that the Drivetrain Approach is the best or only method; our goal is to start a dialog within the data science and business communities to advance our collective vision.
Objective-based data products
We are entering the era of data as drivetrain, where we use data not just to generate more data (in the form of predictions), but use data to produce actionable outcomes. That is the goal of the Drivetrain Approach. The best way to illustrate this process is with a familiar data product: search engines. Back in 1997, AltaVista was king of the algorithmic search world. While their models were good at finding relevant websites, the answer the user was most interested in was often buried on page 100 of the search results. Then, Google came along and transformed online search by beginning with a simple question: What is the user's main objective in typing in a search query?
The four steps in the Drivetrain Approach. Click to enlarge.
Google realized that the objective was to show the most relevant search result; for other companies, it might be increasing profit, improving the customer experience, finding the best path for a robot, or balancing the load in a data center. Once we have specified the goal, the second step is to specify what inputs of the system we can control, the levers we can pull to influence the final outcome. In Google's case, they could control the ranking of the search results. The third step was to consider what new data they would need to produce such a ranking; they realized that the implicit information regarding which pages linked to which other pages could be used for this purpose. Only after these first three steps do we begin thinking about building the predictive models. Our objective and available levers, what data we already have and what additional data we will need to collect, determine the models we can build. The models will take both the levers and any uncontrollable variables as their inputs; the outputs from the models can be combined to predict the final state for our objective.
Step 4 of the Drivetrain Approach for Google is now part of tech history: Larry Page and Sergey Brin invented the graph traversal algorithm PageRank and built an engine on top of it that revolutionized search. But you don't have to invent the next PageRank to build a great data product. We will show a systematic approach to step 4 that doesn't require a PhD in computer science.
The Model Assembly Line: A case study of Optimal Decisions Group
Optimizing for an actionable outcome over the right predictive models can be a company's most important strategic decision. For an insurance company, policy price is the product, so an optimal pricing model is to them what the assembly line is to automobile manufacturing. Insurers have centuries of experience in prediction, but as recently as 10 years ago, the insurance companies often failed to make optimal business decisions about what price to charge each new customer. Their actuaries could build models to predict a customer's likelihood of being in an accident and the expected value of claims. But those models did not solve the pricing problem, so the insurance companies would set a price based on a combination of guesswork and market studies.
This situation changed in 1999 with a company called Optimal Decisions Group (ODG). ODG approached this problem with an early use of the Drivetrain Approach and a practical take on step 4 that can be applied to a wide range of problems. They began by defining the objective that the insurance company was trying to achieve: setting a price that maximizes the net-present value of the profit from a new customer over a multi-year time horizon, subject to certain constraints such as maintaining market share. From there, they developed an optimized pricing process that added hundreds of millions of dollars to the insurers' bottom lines. [Note: Co-author Jeremy Howard founded ODG.]
ODG identified which levers the insurance company could control: what price to charge each customer, what types of accidents to cover, how much to spend on marketing and customer service, and how to react to their competitors' pricing decisions. They also considered inputs outside of their control, like competitors' strategies, macroeconomic conditions, natural disasters, and customer "stickiness." They considered what additional data they would need to predict a customer's reaction to changes in price. It was necessary to build this dataset by randomly changing the prices of hundreds of thousands of policies over many months. While the insurers were reluctant to conduct these experiments on real customers, as they'd certainly lose some customers as a result, they were swayed by the huge gains that optimized policy pricing might deliver. Finally, ODG started to design the models that could be used to optimize the insurer's profit.
Drivetrain Step 4: The Model Assembly Line. Picture a Model Assembly Line for data products that transforms the raw data into an actionable outcome. The Modeler takes the raw data and converts it into slightly more refined predicted data. Click to enlarge.
The first component of ODG's Modeler was a model of price elasticity (the probability that a customer will accept a given price) for new policies and for renewals. The price elasticity model is a curve of price versus the probability of the customer accepting the policy conditional on that price. This curve moves from almost certain acceptance at very low prices to almost never at high prices.
The second component of ODG's Modeler related price to the insurance company's profit, conditional on the customer accepting this price. The profit for a very low price will be in the red by the value of expected claims in the first year, plus any overhead for acquiring and servicing the new customer. Multiplying these two curves creates a final curve that shows price versus expected profit (see Expected Profit figure, below). The final curve has a clearly identifiable local maximum that represents the best price to charge a customer for the first year.
Expected profit.
ODG also built models for customer retention. These models predicted whether customers would renew their policies in one year, allowing for changes in price and willingness to jump to a competitor. These additional models allow the annual models to be combined to predict profit from a new customer over the next five years.
This new suite of models is not a final answer because it only identifies the outcome for a given set of inputs. The next machine on the assembly line is a Simulator, which lets ODG ask the "what if" questions to see how the levers affect the distribution of the final outcome. The expected profit curve is just a slice of the surface of possible outcomes. To build that entire surface, the Simulator runs the models over a wide range of inputs. The operator can adjust the input levers to answer specific questions like, "What will happen if our company offers the customer a low teaser price in year one but then raises the premiums in year two?" They can also explore how the distribution of profit is shaped by the inputs outside of the insurer's control: "What if the economy crashes and the customer loses his job? What if a 100-year flood hits his home? If a new competitor enters the market and our company does not react, what will be the impact on our bottom line?" Because the simulation is at a per-policy level, the insurer can view the impact of a given set of price changes on revenue, market share, and other metrics over time.
The Simulator's result is fed to an Optimizer, which takes the surface of possible outcomes and identifies the highest point. The Optimizer not only finds the best outcomes, it can also identify catastrophic outcomes and show how to avoid them. There are many different optimization techniques to choose from (see see sidebar, below), but it is a well-understood field with robust and accessible solutions. ODG's competitors use different techniques to find an optimal price, but they are shipping the same over-all data product. What matters is that using a Drivetrain Approach combined with a Model Assembly Line bridges the g[…]
9 weeks ago by rahuldave
Profile of the Data Journalist: The Storyteller and The Teacher
12 weeks ago by rahuldave
Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.
Sarah Cohen (@sarahduke), the Knight professor of the practice of journalism and public policy at Duke University, and Anthony DeBarros (@AnthonyDB), the senior database editor at USA Today, were both important sources of historical perspective for my feature on how data journalism is evolving from "computer-assisted reporting" (CAR) to a powerful Web-enabled practice that uses cloud computing, machine learning and algorithms to make sense of unstructured data.
The latter halves of our interviews, which focused upon their personal and professional experience, follow.
What data journalism project are you the most proud of working on or creating?
DeBarros: "In 2006, my USA TODAY colleague Robert Davis and I built a database of 620 students killed on or near college campuses and mined it to show how freshmen were uniquely vulnerable. It was a heart-breaking but vitally important story to tell. We won the 2007 Missouri Lifestyle Journalism Awards for the piece, and followed it with an equally wrenching look at student deaths from fires."
Cohen: "I'd have to say the Pulitzer-winning series on child deaths in DC, in which we documented that children were dying in predictable circumstances after key mistakes by people who knew that their
agencies had specific flaws that could let them fall through the cracks.
I liked working on the Post's POTUS Tracker and Head Count. Those were Web projects that were geared at accumulating lots of little bits about Obama's schedule and his appointees, respectively, that we could share with our readers while simultaneously building an important dataset for use down the road. Some of the Post's Solyndra and related stories, I have heard, came partly from studying the president's trips in POTUS Tracker.
There was one story, called "Misplaced Trust," on DC's guardianship
system, that created immediate change in Superior Court, which was
gratifying. "Harvesting Cash," our 18-month project on farm subsidies, also helped point out important problems in that system.
The last one, I'll note, is a piece of a project I worked on,
in which the DC water authority refused to release the results of a
massive lead testing effort, which in turn had shown widespread
contamination. We got the survey from a source, but it was on paper.
After scanning, parsing, and geocoding, we sent out a team of reporters to
neighborhoods to spot check the data, and also do some reporting on the
neighborhoods. We ended up with a story about people who didn't know what
was near them.
We also had an interesting experience: the water
authority called our editor to complain that we were going to put all of
the addresses online -- they felt that it was violating peoples' privacy,
even though we weren't identifyng the owners or the residents. It was more
important to them that we keep people in the dark about their blocks. Our
editor at the time, Len Downie, said, "you're right. We shouldn't just put
it on the Web." He also ordered up a special section to put them all in
print.
Where do you turn to keep your skills updated or learn new things?
Cohen: "It's actually a little harder now that I'm out of the newsroom,
surprisingly. Before, I would just dive into learning something when I'd
heard it was possible and I wanted to use it to get to a story. Now I'm
less driven, and I have to force myself a little more. I'm hoping to start
doing more reporting again soon, and that the Reporters' Lab will help
there too.
Lately, I've been spending more time with people from other
disciplines to understand better what's possible, like machine learning
and speech recognition at Carnegie Mellon and MIT, or natural language
processing at Stanford. I can't DO them, but getting a chance to
understand what's out there is useful. NewsFoo, SparkCamp and NICAR are
the three places that had the best bang this year. I wish I could have
gone to Strata, even if I didn't understand it all."
DeBarros: For surveillance, I follow really smart people on Twitter and have several key Google Reader subscriptions.
To learn, I spend a lot of time training after work hours. I've really been pushing myself in the last couple of years to up my game and stay relevant, particularly by learning Python, Linux and web development. Then I bring it back to the office and use it for web scraping and app building.
Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?
Cohen: "I think anything that gets more leverage out of fewer people is
important in this age, because fewer people are working full time holding
government accountable. The news apps help get more eyes on what the
government is doing by getting more of what we work with and let them see
it. I also think it helps with credibility -- the 'show your work' ethos --
because it forces newsrooms to be more transparent with readers / viewers.
For instance, now, when I'm judging an investigative prize, I am quite
suspicious of any project that doesn't let you see each item, I.e., when
they say, "there were 300 cases that followed this pattern," I want to see
all 300 cases, or all cases with the 300 marked, so I can see whether I
agree.
DeBarros: "They're important because we're living in a data-driven culture. A data-savvy journalist can use the Twitter API or a spreadsheet to find news as readily as he or she can use the telephone to call a source. Not only that, we serve many readers who are accustomed to dealing with data every day -- accountants, educators, researchers, marketers. If we're going to capture their attention, we need to speak the language of data with authority. And they are smart enough to know whether we've done our research correctly or not.
As for news apps, they're important because -- when done right -- they can make large amounts of data easily understood and relevant to each person using them."
These interviews were edited and condensed for clarity.
Data
Gov_2.0
Publishing
datajournalism
dataproduct
datascience
datavisualization
narrative
nicarinterview
from google
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.
Sarah Cohen (@sarahduke), the Knight professor of the practice of journalism and public policy at Duke University, and Anthony DeBarros (@AnthonyDB), the senior database editor at USA Today, were both important sources of historical perspective for my feature on how data journalism is evolving from "computer-assisted reporting" (CAR) to a powerful Web-enabled practice that uses cloud computing, machine learning and algorithms to make sense of unstructured data.
The latter halves of our interviews, which focused upon their personal and professional experience, follow.
What data journalism project are you the most proud of working on or creating?
DeBarros: "In 2006, my USA TODAY colleague Robert Davis and I built a database of 620 students killed on or near college campuses and mined it to show how freshmen were uniquely vulnerable. It was a heart-breaking but vitally important story to tell. We won the 2007 Missouri Lifestyle Journalism Awards for the piece, and followed it with an equally wrenching look at student deaths from fires."
Cohen: "I'd have to say the Pulitzer-winning series on child deaths in DC, in which we documented that children were dying in predictable circumstances after key mistakes by people who knew that their
agencies had specific flaws that could let them fall through the cracks.
I liked working on the Post's POTUS Tracker and Head Count. Those were Web projects that were geared at accumulating lots of little bits about Obama's schedule and his appointees, respectively, that we could share with our readers while simultaneously building an important dataset for use down the road. Some of the Post's Solyndra and related stories, I have heard, came partly from studying the president's trips in POTUS Tracker.
There was one story, called "Misplaced Trust," on DC's guardianship
system, that created immediate change in Superior Court, which was
gratifying. "Harvesting Cash," our 18-month project on farm subsidies, also helped point out important problems in that system.
The last one, I'll note, is a piece of a project I worked on,
in which the DC water authority refused to release the results of a
massive lead testing effort, which in turn had shown widespread
contamination. We got the survey from a source, but it was on paper.
After scanning, parsing, and geocoding, we sent out a team of reporters to
neighborhoods to spot check the data, and also do some reporting on the
neighborhoods. We ended up with a story about people who didn't know what
was near them.
We also had an interesting experience: the water
authority called our editor to complain that we were going to put all of
the addresses online -- they felt that it was violating peoples' privacy,
even though we weren't identifyng the owners or the residents. It was more
important to them that we keep people in the dark about their blocks. Our
editor at the time, Len Downie, said, "you're right. We shouldn't just put
it on the Web." He also ordered up a special section to put them all in
print.
Where do you turn to keep your skills updated or learn new things?
Cohen: "It's actually a little harder now that I'm out of the newsroom,
surprisingly. Before, I would just dive into learning something when I'd
heard it was possible and I wanted to use it to get to a story. Now I'm
less driven, and I have to force myself a little more. I'm hoping to start
doing more reporting again soon, and that the Reporters' Lab will help
there too.
Lately, I've been spending more time with people from other
disciplines to understand better what's possible, like machine learning
and speech recognition at Carnegie Mellon and MIT, or natural language
processing at Stanford. I can't DO them, but getting a chance to
understand what's out there is useful. NewsFoo, SparkCamp and NICAR are
the three places that had the best bang this year. I wish I could have
gone to Strata, even if I didn't understand it all."
DeBarros: For surveillance, I follow really smart people on Twitter and have several key Google Reader subscriptions.
To learn, I spend a lot of time training after work hours. I've really been pushing myself in the last couple of years to up my game and stay relevant, particularly by learning Python, Linux and web development. Then I bring it back to the office and use it for web scraping and app building.
Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?
Cohen: "I think anything that gets more leverage out of fewer people is
important in this age, because fewer people are working full time holding
government accountable. The news apps help get more eyes on what the
government is doing by getting more of what we work with and let them see
it. I also think it helps with credibility -- the 'show your work' ethos --
because it forces newsrooms to be more transparent with readers / viewers.
For instance, now, when I'm judging an investigative prize, I am quite
suspicious of any project that doesn't let you see each item, I.e., when
they say, "there were 300 cases that followed this pattern," I want to see
all 300 cases, or all cases with the 300 marked, so I can see whether I
agree.
DeBarros: "They're important because we're living in a data-driven culture. A data-savvy journalist can use the Twitter API or a spreadsheet to find news as readily as he or she can use the telephone to call a source. Not only that, we serve many readers who are accustomed to dealing with data every day -- accountants, educators, researchers, marketers. If we're going to capture their attention, we need to speak the language of data with authority. And they are smart enough to know whether we've done our research correctly or not.
As for news apps, they're important because -- when done right -- they can make large amounts of data easily understood and relevant to each person using them."
These interviews were edited and condensed for clarity.
12 weeks ago by rahuldave
Profile of the Data Journalist: The Hacks Hacker
12 weeks ago by rahuldave
Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference. This interview followed the conference and featured a remote participant who diligently used social media and the World Wide Web to document and share the best of NICAR:
The #nicar12 tips, tools & links page has new mapping, info design & election presos & references: j.mp/y5XD7f /cc @kn0wtheory
— Chrys Wu (@MacDiva) February 24, 2012
Chrys Wu (@MacDiva) is a data journalist and user engagement strategist based in New York City. Our interview follows.
Where do you work now? What is a day in your life like?
I work with clients through my company, Matchstrike, which specializes in user engagement strategy. It's a combination of user experience research, design and program planning. Businesses turn to me to figure out how to keep people's attention, create community and tie that back to return on investment.
I also launch Hacks/Hackers chapters around the world and co-organize the group in New York with Al Shaw of ProPublica and Jacqui Cox of The New York Times.
Both things involve seeking out people and ideas, asking questions, reading, wireframing and understanding what motivates people as individuals and as groups.
How did you get started in data journalism? Did you get any special degrees or certificates?
I had a stats class in high school with a really terrific instructor who also happened to be the varsity basketball coach. He was kind of like our John Wooden. Realizing the importance of statistics, being able to organize and interpret data — and learning how to be skeptical of claims (e.g., where "4 out of 5 dentists agree" comes from)— has always stayed with me.
Other than that class and studying journalism at university, what I know has come from exploring (finding what's out there), doing (making something) and working (making something for money). I think that's pretty similar to most journalists and journalist-developers currently in the field.
Though I've spent several years in newsrooms (most notably with the Los Angeles Times and CBS Digital Media Group), most of my journalism and communications career has been as a freelancer. One of my earliest clients specialized in fundraising for Skid Row shelters. I quantified the need cases for her proposals. That involved working closely with the city health and child welfare departments and digging through a lot of data.
Once I figured that out, it was important to balance the data with narrative. Numbers and charts have a much more profound impact on people if they're framed by an idea to latch onto and compelling story to share.
Did you have any mentors? Who? What were the most important resources they shared with you?
I don't have individual mentors, but there's an active community with a huge body of work out there to learn from. It's one of the reasons why I've been collecting things on Delicious and Pinboard, and it's why I try my best to put everything that's taught at NICAR on my blog.
I always try look beyond journalism to see what people are thinking about and doing in other fields. Great ideas can come from everywhere. There are lots of very smart people willing to share what they know.
What does your personal data journalism "stack" look like? What tools could you not live without?
I use Coda and TextMate most often. For wireframing, I'm a big fan of OmniGraffle. I code in Ruby, and a little bit in Python. I'm starting to learn how to use R for dataset manipulation and for its maps library.
For keeping tabs on new but not urgent-to-read material, I use my friend Samuel Clay's RSS reader, Newsblur.
What data journalism project are you the most proud of working on or creating?
I'm most proud of working with the Hacks/Hackers community. Since 2009, we've grown to more than 40 groups worldwide, with each locality bringing journalists, designers and developers together to push what's possible for news.
As I say, talking is good; making is better — and the individual Hacks/Hackers chapters have all done some version of that: presentations, demos, classes and hack days. They're all opportunities to share knowledge, make friends and create new things that help people better understand what's happening around them.
Where do you turn to keep your skills updated or learn new things?
MIT's open courses have been great. There's also blogs, mailing lists, meetups, lectures and conferences. And then there's talking with friends and people they know.
Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?
I like Amanda Cox's view of the importance of reporting through data. She's a New York Times graphics editor who comes from a statistics background. To paraphrase: Presenting a pile of facts and numbers without directing people toward any avenue of understanding is not useful.
Journalism is fundamentally about fact-finding and opening eyes. One of the best ways to do that, especially when lots of people are affected by something, is to interweave narrative with quantifiable information.
Data journalism and news apps create the lens that shows people the big picture they couldn't see but maybe had a hunch about otherwise. That's important for a greater understanding of the things that matter to us as individuals and as a society.
This interview has been edited and condensed for clarity.
Data
Gov_2.0
Publishing
civicapps
civichacker
datajouralism
datascience
datavisualization
nicarinterview
from google
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference. This interview followed the conference and featured a remote participant who diligently used social media and the World Wide Web to document and share the best of NICAR:
The #nicar12 tips, tools & links page has new mapping, info design & election presos & references: j.mp/y5XD7f /cc @kn0wtheory
— Chrys Wu (@MacDiva) February 24, 2012
Chrys Wu (@MacDiva) is a data journalist and user engagement strategist based in New York City. Our interview follows.
Where do you work now? What is a day in your life like?
I work with clients through my company, Matchstrike, which specializes in user engagement strategy. It's a combination of user experience research, design and program planning. Businesses turn to me to figure out how to keep people's attention, create community and tie that back to return on investment.
I also launch Hacks/Hackers chapters around the world and co-organize the group in New York with Al Shaw of ProPublica and Jacqui Cox of The New York Times.
Both things involve seeking out people and ideas, asking questions, reading, wireframing and understanding what motivates people as individuals and as groups.
How did you get started in data journalism? Did you get any special degrees or certificates?
I had a stats class in high school with a really terrific instructor who also happened to be the varsity basketball coach. He was kind of like our John Wooden. Realizing the importance of statistics, being able to organize and interpret data — and learning how to be skeptical of claims (e.g., where "4 out of 5 dentists agree" comes from)— has always stayed with me.
Other than that class and studying journalism at university, what I know has come from exploring (finding what's out there), doing (making something) and working (making something for money). I think that's pretty similar to most journalists and journalist-developers currently in the field.
Though I've spent several years in newsrooms (most notably with the Los Angeles Times and CBS Digital Media Group), most of my journalism and communications career has been as a freelancer. One of my earliest clients specialized in fundraising for Skid Row shelters. I quantified the need cases for her proposals. That involved working closely with the city health and child welfare departments and digging through a lot of data.
Once I figured that out, it was important to balance the data with narrative. Numbers and charts have a much more profound impact on people if they're framed by an idea to latch onto and compelling story to share.
Did you have any mentors? Who? What were the most important resources they shared with you?
I don't have individual mentors, but there's an active community with a huge body of work out there to learn from. It's one of the reasons why I've been collecting things on Delicious and Pinboard, and it's why I try my best to put everything that's taught at NICAR on my blog.
I always try look beyond journalism to see what people are thinking about and doing in other fields. Great ideas can come from everywhere. There are lots of very smart people willing to share what they know.
What does your personal data journalism "stack" look like? What tools could you not live without?
I use Coda and TextMate most often. For wireframing, I'm a big fan of OmniGraffle. I code in Ruby, and a little bit in Python. I'm starting to learn how to use R for dataset manipulation and for its maps library.
For keeping tabs on new but not urgent-to-read material, I use my friend Samuel Clay's RSS reader, Newsblur.
What data journalism project are you the most proud of working on or creating?
I'm most proud of working with the Hacks/Hackers community. Since 2009, we've grown to more than 40 groups worldwide, with each locality bringing journalists, designers and developers together to push what's possible for news.
As I say, talking is good; making is better — and the individual Hacks/Hackers chapters have all done some version of that: presentations, demos, classes and hack days. They're all opportunities to share knowledge, make friends and create new things that help people better understand what's happening around them.
Where do you turn to keep your skills updated or learn new things?
MIT's open courses have been great. There's also blogs, mailing lists, meetups, lectures and conferences. And then there's talking with friends and people they know.
Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?
I like Amanda Cox's view of the importance of reporting through data. She's a New York Times graphics editor who comes from a statistics background. To paraphrase: Presenting a pile of facts and numbers without directing people toward any avenue of understanding is not useful.
Journalism is fundamentally about fact-finding and opening eyes. One of the best ways to do that, especially when lots of people are affected by something, is to interweave narrative with quantifiable information.
Data journalism and news apps create the lens that shows people the big picture they couldn't see but maybe had a hunch about otherwise. That's important for a greater understanding of the things that matter to us as individuals and as a society.
This interview has been edited and condensed for clarity.
12 weeks ago by rahuldave
Profile of the Data Journalist: The Data Editor
12 weeks ago by rahuldave
Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.
Meghan Hoyer (@MeghanHoyer) is a data editor based in Virginia. Our interview follows.
Where do you work now? What is a day in your life like?
I work in an office within The Virginian Pilot’s newsroom. I’m a one-person team, so there’s no such thing as typical.
What I might do: Help a reporter pull Census data, work with IT on improving our online crime report app, create a DataTable of city property assessment changes, and plan training for a group of co-workers who’d like to grow their online skills. At least, that’s what I’m doing today.
Tomorrow, it’ll be helping with our online election report, planning a strategy to clean a dirty database, and working with a reporter to crunch data for a crime trend story.
How did you get started in data journalism? Did you get any special degrees or certificates?
I have a journalism degree from Northwestern, but I got started the same way most reporters probably got started - I had questions about my community and I wanted quantifiable answers. How had the voting population in a booming suburb changed? Who was the region’s worst landlord? Were our localities going after delinquent taxpayers? Anecdotes are nice, but it’s an amazingly powerful thing to be able to get the true measure of a situation. Numbers and analysis help provide a better focus - and sometimes, they upend entirely your initial theory.
Did you have any mentors? Who? What were the most important resources they shared with you?
I haven’t collected a singular mentor as much as a group of people whose work I keep tabs on, for inspiration and follow-up. The news community is pretty small. A lot of people have offered suggestions, guidance, cheat sheets and help over the years. Data journalism - from analysis to building apps -- is definitely not something you can or need to learn in a bubble all on your own.
What does your personal data journalism "stack" look like? What tools could you not live without?
In terms of daily tools, I keep it basic: Google docs, Fusion Tables and Refine, QGIS, SQLite and Excel are all in use pretty much every day.
I’ve learned some Python and JavaScript for specific projects and to automate some of the newsroom’s daily tasks, but I definitely don’t have the programming or technical background that a lot of people in this field have. That’s left me trying to learn as much as I can as quick as I can.
In terms of a data stack, we keep information such as public employee salaries, land assessment databases and court record databases (among others) updated in a shared drive in our newsroom. It’s amazing how often reporters use them, even if it’s just to find out which properties a candidate owns or how long a police officer caught at a DUI checkpoint has been on the force.
What data journalism project are you the most proud of working on or creating?
I'm proud of using regional records to do an analysis which forced Norfolk to revamp its whole real estate tax collection process.
A few years ago, I combined property ownership records, code enforcement citations, real estate tax records and rental inspection information from all our local cities into "Cashing Blight -- and found a company with hundreds of derelict properties.
Their properties seemed to change hands often, so a partner and I then hand-built a database from thousands of land deeds that proved the company was flipping houses among investors in a $26 million mortgage fraud scheme. None of the cities in our region had any idea this was going on because they were dealing with each parcel as a separate entity.
That’s what combining sets of data can get you - a better overall view of what’s really happening. While government agencies are great at collecting piles of data, it’s that kind of larger analysis that’s missing.
Where do you turn to keep your skills updated or learn new things?
To be honest - Twitter. I get a lot of ideas and updates on new tools there. And the NICAR conference and listserv. Usually when you hit up against a problem - whether it’s dealing with a dirty dataset or figuring out how to best visualize your data -- it’s something that someone else has already faced.
I also learn a lot from the people within our newsroom. We have a talented group of web producers who all are eager to try new things and learn.
Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?
Data is everywhere, but in most cases it’s just stockpiled and warehoused without a second thought to analysis or using it to solve larger problems.
Journalists are in a unique position to make sense of it, to find the stories in it, to make sure that governments and organizations are considering the larger picture.
I think, too, that people in our field need to truly push for open government in the sense not of government building interfaces for data, but for just releasing raw data streams. Government is still far too stuck in the “Here’s a PDF of a spreadsheet” mentality. That doesn’t create informed citizens, and it doesn’t lead to innovative ways of thinking about government.
I’ve been involved recently in a community effort to create an API and then apps out of the regional transit authority’s live bus GPS stream. It has been a really fun project - and one I hope makes local governments rethink their practices.
Data
Gov_2.0
Publishing
civichacker
datajouralism
dataproduct
datascience
nicarinterview
from google
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.
Meghan Hoyer (@MeghanHoyer) is a data editor based in Virginia. Our interview follows.
Where do you work now? What is a day in your life like?
I work in an office within The Virginian Pilot’s newsroom. I’m a one-person team, so there’s no such thing as typical.
What I might do: Help a reporter pull Census data, work with IT on improving our online crime report app, create a DataTable of city property assessment changes, and plan training for a group of co-workers who’d like to grow their online skills. At least, that’s what I’m doing today.
Tomorrow, it’ll be helping with our online election report, planning a strategy to clean a dirty database, and working with a reporter to crunch data for a crime trend story.
How did you get started in data journalism? Did you get any special degrees or certificates?
I have a journalism degree from Northwestern, but I got started the same way most reporters probably got started - I had questions about my community and I wanted quantifiable answers. How had the voting population in a booming suburb changed? Who was the region’s worst landlord? Were our localities going after delinquent taxpayers? Anecdotes are nice, but it’s an amazingly powerful thing to be able to get the true measure of a situation. Numbers and analysis help provide a better focus - and sometimes, they upend entirely your initial theory.
Did you have any mentors? Who? What were the most important resources they shared with you?
I haven’t collected a singular mentor as much as a group of people whose work I keep tabs on, for inspiration and follow-up. The news community is pretty small. A lot of people have offered suggestions, guidance, cheat sheets and help over the years. Data journalism - from analysis to building apps -- is definitely not something you can or need to learn in a bubble all on your own.
What does your personal data journalism "stack" look like? What tools could you not live without?
In terms of daily tools, I keep it basic: Google docs, Fusion Tables and Refine, QGIS, SQLite and Excel are all in use pretty much every day.
I’ve learned some Python and JavaScript for specific projects and to automate some of the newsroom’s daily tasks, but I definitely don’t have the programming or technical background that a lot of people in this field have. That’s left me trying to learn as much as I can as quick as I can.
In terms of a data stack, we keep information such as public employee salaries, land assessment databases and court record databases (among others) updated in a shared drive in our newsroom. It’s amazing how often reporters use them, even if it’s just to find out which properties a candidate owns or how long a police officer caught at a DUI checkpoint has been on the force.
What data journalism project are you the most proud of working on or creating?
I'm proud of using regional records to do an analysis which forced Norfolk to revamp its whole real estate tax collection process.
A few years ago, I combined property ownership records, code enforcement citations, real estate tax records and rental inspection information from all our local cities into "Cashing Blight -- and found a company with hundreds of derelict properties.
Their properties seemed to change hands often, so a partner and I then hand-built a database from thousands of land deeds that proved the company was flipping houses among investors in a $26 million mortgage fraud scheme. None of the cities in our region had any idea this was going on because they were dealing with each parcel as a separate entity.
That’s what combining sets of data can get you - a better overall view of what’s really happening. While government agencies are great at collecting piles of data, it’s that kind of larger analysis that’s missing.
Where do you turn to keep your skills updated or learn new things?
To be honest - Twitter. I get a lot of ideas and updates on new tools there. And the NICAR conference and listserv. Usually when you hit up against a problem - whether it’s dealing with a dirty dataset or figuring out how to best visualize your data -- it’s something that someone else has already faced.
I also learn a lot from the people within our newsroom. We have a talented group of web producers who all are eager to try new things and learn.
Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?
Data is everywhere, but in most cases it’s just stockpiled and warehoused without a second thought to analysis or using it to solve larger problems.
Journalists are in a unique position to make sense of it, to find the stories in it, to make sure that governments and organizations are considering the larger picture.
I think, too, that people in our field need to truly push for open government in the sense not of government building interfaces for data, but for just releasing raw data streams. Government is still far too stuck in the “Here’s a PDF of a spreadsheet” mentality. That doesn’t create informed citizens, and it doesn’t lead to innovative ways of thinking about government.
I’ve been involved recently in a community effort to create an API and then apps out of the regional transit authority’s live bus GPS stream. It has been a really fun project - and one I hope makes local governments rethink their practices.
12 weeks ago by rahuldave
Profile of the Data Journalist: The Daily Visualizer
12 weeks ago by rahuldave
Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.
Matt Stiles (@Stiles) , a data journalist based in Washington, D.C., maintains a popular Daily Visualization blog. Our interview follows.
Where do you work now? What is a day in your life like?
I work at NPR, where I oversee data journalism on the State Impact project, a local-national partnership between us and member stations. My typical day always begins with a morning "scrum" meeting among the D.C. team as part of our agile development process. I spend time acquiring and analyzing data throughout each data, and I typically work directly with reporters, training them on software and data visualization techniques. I also spend time planning news apps and interactives, a process that requires close consultation with reporters, designers and developers.
How did you get started in data journalism? Did you get any special degrees or certificates?
No special training or certificates, though I did attend three NICAR boot camps (databases, mapping, statistics) over the years.
Did you have any mentors? Who? What were the most important resources they shared with you?
I have several mentors, both on the reporting side and the data side. For data, I wouldn't be where I am today without the help of two people: Chase Davis and Jennifer LaFleur. Jen got me interested early, and has helped me with formal and informal training over the years. Chase helped me with day-to-day questions when we worked together at the Houston Chronicle.
What does your personal data journalism "stack" look like? What tools could you not live without?
I have a MacBook that runs Windows 7. I have the basic CAR suite (Excel/Access, ArcGIS, SPSS, etc.) but also plenty of open-source tools, such as R for visualization or MySQL/Postgres for databases. I use Coda and Text Mate for coding. I use BBEdit and Python for text manipulation. I also couldn't live without Photoshop and Illustrator for cleaning up graphics.
What data journalism project are you the most proud of working on or creating?
I'm most proud of the online data library I created (and others have since expanded) at The Texas Tribune, but we're building some sweet apps at NPR. That's only going to expand now that we've created a national news apps team, which I'm joining soon.
Where do you turn to keep your skills updated or learn new things?
I read blogs, subscribe to email lists and attend lots of conferences for inspiration. There's no silver bullet. If you love this stuff, you'll keep up.
Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?
More and more information is coming at us every day. The deluge is so vast. Data journalism at its core is important because it's about facts, not anecdotes.
Apps are important because Americans are already savvy data consumers, even if they don't know it. We must get them thinking -- or, even better, not thinking -- about news consumption in the same way they think about syncing their iPads or booking flights on Priceline or purchasing items on eBay. These are all "apps" that are familiar to many people. Interactive news should be, too.
This interview has been edited and condensed for clarity.
Data
Gov_2.0
Publishing
datajournalism
dataproduct
datascience
datavisualization
nicarinterview
npr
opendata
from google
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.
Matt Stiles (@Stiles) , a data journalist based in Washington, D.C., maintains a popular Daily Visualization blog. Our interview follows.
Where do you work now? What is a day in your life like?
I work at NPR, where I oversee data journalism on the State Impact project, a local-national partnership between us and member stations. My typical day always begins with a morning "scrum" meeting among the D.C. team as part of our agile development process. I spend time acquiring and analyzing data throughout each data, and I typically work directly with reporters, training them on software and data visualization techniques. I also spend time planning news apps and interactives, a process that requires close consultation with reporters, designers and developers.
How did you get started in data journalism? Did you get any special degrees or certificates?
No special training or certificates, though I did attend three NICAR boot camps (databases, mapping, statistics) over the years.
Did you have any mentors? Who? What were the most important resources they shared with you?
I have several mentors, both on the reporting side and the data side. For data, I wouldn't be where I am today without the help of two people: Chase Davis and Jennifer LaFleur. Jen got me interested early, and has helped me with formal and informal training over the years. Chase helped me with day-to-day questions when we worked together at the Houston Chronicle.
What does your personal data journalism "stack" look like? What tools could you not live without?
I have a MacBook that runs Windows 7. I have the basic CAR suite (Excel/Access, ArcGIS, SPSS, etc.) but also plenty of open-source tools, such as R for visualization or MySQL/Postgres for databases. I use Coda and Text Mate for coding. I use BBEdit and Python for text manipulation. I also couldn't live without Photoshop and Illustrator for cleaning up graphics.
What data journalism project are you the most proud of working on or creating?
I'm most proud of the online data library I created (and others have since expanded) at The Texas Tribune, but we're building some sweet apps at NPR. That's only going to expand now that we've created a national news apps team, which I'm joining soon.
Where do you turn to keep your skills updated or learn new things?
I read blogs, subscribe to email lists and attend lots of conferences for inspiration. There's no silver bullet. If you love this stuff, you'll keep up.
Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?
More and more information is coming at us every day. The deluge is so vast. Data journalism at its core is important because it's about facts, not anecdotes.
Apps are important because Americans are already savvy data consumers, even if they don't know it. We must get them thinking -- or, even better, not thinking -- about news consumption in the same way they think about syncing their iPads or booking flights on Priceline or purchasing items on eBay. These are all "apps" that are familiar to many people. Interactive news should be, too.
This interview has been edited and condensed for clarity.
12 weeks ago by rahuldave
Profile of the Data Journalist: The Visualizer
march 2012 by rahuldave
Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.
Michelle Minkoff (@MichelleMinkoff ) is an investigative developer/journalist based in Washington, D.C. Our interview follows.
Where do you work now? What is a day in your life like?
I am an Interactive Producer at the Associated Press' Washington DC bureau, where I focus on news applications related to politics and the election, as well as general mapping for our interactives on the Web. While my days pretty much always involve sitting in front of a computer, the actual tasks themselves can vary wildly. I may be chatting with reporters and editors in politics, environment, educational, national security or myriad beats about upcoming stories and how to use data to support reporting or create interactive stories. I might be gathering data, reformatting it or crafting Web applications. I spend a great deal of time creating interactive mapping systems, working a lot with geographic data, and collaborating with cartographers, editors and designers to decide how to best display it.
I split my time between working closely with my colleagues in the Washington bureau on the reporting/editing side, and my fellow interactive team members, only one of whom is also in DC. Our team is global, headquartered in New York, but with members spanning the globe from Phoenix to Bangkok.
It's a question of walking a balance between what needs to be done on daily deadlines for breaking news, longer-term stories which are often investigative, and creating frameworks that help The Associated Press to make the most of the Web's interactive nature in the long run.
How did you get started in data journalism? Did you get any special degrees or certificates?
I caught the bug when I took a computer-assisted reporting class from Derek Willis, a member of the New York Times' Interactive News Team, at Northwestern's journalism school where I was a grad student. I was fascinated by the role that technology could play in journalism for reporting and presentation, and very quickly got hooked. I also quickly discovered that I could lose track of hours playing with these tools, and that what came naturally to me was not as natural to others. I would spend days reporting for class, on and off Capitol Hill, and nights exchanging gchats with Derek and other data journalists he introduced me to. I started to understand SQL, advanced Excel, and fairly quickly thereafter, Python and Django.
I followed this up with an independent study in data visualization back at Medill's Chicago campus, under Rich Gordon. I practiced making Django apps, played with the Processing visualization language. I voraciously read through all the Tufte books. As a final project, I created a package about the persistence of Chicago art galleries that encompasses text, Flash visualization and a searchable database.
I have a concentration in Interactive Journalism, with my Medill masters' degree, but the courses mentioned above are but a partial component of that concentration.
Did you have any mentors? Who? What were the most important resources they shared with you?
The question here is in the wrong tense. I currently "do" have many mentors, and I don't know how I would do my job without what they've shared in the past, and in the present. Derek, mentioned above, was the first. He introduced me to his friend Matt [Waite], and then he told me there was a whole group of people doing this work at NICAR. Literally hundreds of people from that organization have helped me at various places on my journey, and I believe strongly in the mantra of "paying it forward" as they have -- no one can know it all, so we pass on what we've learned, so more people can do even better work.
Other key folks I've had the privilege to work with include all of the Los Angeles Times' Data Desk's members, which includes reporters, editors and Web developers. I worked most closely with Ben Welsh and Ken Schwencke, who answered many questions, and were extremely encouraging when I was at the very beginning of my journey.
At my current job at The Associated Press, I'm lucky to have teammates who mentor me in design, mapping and various Washington-based beats. Each is helpful in his or her own way.
Special attention deserves to be called to Jonathan Stray, who's my official boss, but also a fantastic mentor who enables me to do what I do. He's helping me to learn the appropriate technical skills to execute what I see in my head, as well as learn how to learn. He's not just teaching me the answers to the problems we encounter in our daily work, but also helping me learn how to better solve them, and work this whole "thing I do" into a sustainable career path. And all with more patience than I have for myself.
What does your personal data journalism "stack" look like? What tools could you not live without?
No matter how advanced our tools get, I always find myself coming back to Excel first to do simple work. It helps us an overall handle on a data set. I also will often quickly bring data into SQLite, a Firefox extension that allows a user to run SQL queries, with no database setup. I'm more comfortable asking complicated questions of data that way. I also like to use Google's Chart Tools to create quick visualizations for myself to better understand a story.
When it comes to presentation, since I've been doing a lot with mapping recently, I don't know what I'd do without my favorite open source tools, Tilemill and Leaflet. Building a map stack is hard work, but the work that others have done before it have made it a lot easier.
If we consider programming languages tools (which I do), JavaScript is my new Swiss army knife. Prior to coming to the AP, I did a lot with Python and Django, but I've learned a lot about what I like to call "Really Hard JavaScript." It's not just about manipulating the colors of a background on a Web page, but parsing, analyzing and presenting data. When I need to do more complex work to manipulate data, I use a combination of Ruby and Python -- depending on which has better tools for the job. For XML parsing, I like Ruby more. For simplifying geo data, I prefer Python.
What data journalism project are you the most proud of working on or creating?
That would be " Road to 270", a project we did at the AP that allows users to test out hypothetical "what-if" scenarios for the national election, painting states to define to which candidate a state's delegates could go. It combines demographic and past election data with the ability for users to make a choice and deeply engage with the interactive. It's not just telling the user a story, but informing the user by allowing him or her to be part of the story. That, I believe, is when data journalism becomes its most compelling and informative.
It also uses some advanced technical mapping skills that were new to me. I greatly enjoyed the thrill of learning how to structure a complex application, and add new tools to my toolkit. Now, I don't just have those new tools, but a better understanding of how to add other new tools.
Where do you turn to keep your skills updated or learn new things?
I look at other projects, both within the journalism industry and in general visualization communities. The Web inspector is my best friend. I'm always looking to see how people did things. I read blogs voraciously, and have a fairly robust Google Reader set of people whose work I follow closely. I also use lynda.com frequently (I tend to learn best by video tutorials.) Hanging out on listservs for free tools I use (such as Leaflet), programming languages I care about (Python), or projects whose mission our work is related to (Sunlight Foundation) help me engage with a community that cares about similar issues.
Help sites like Stack Overflow, and pretty much anything I can find on Google, are my other best friends. The not-so-secret secret of data journalism: we're learning as we go. That's part of what makes it so fun.
Really, the learning is not about paper or electronic resources. Like so much of journalism, this is best conquered, I argue, with persistence and stick-to-it-ness. I approach the process of data journalism and Web development as a beat. We attend key meetings. Instead of city council, it's NICAR. We develop vast rolodexes. I know people who have myriad specialties and feel comfortable calling on them. In return, I help people all over the world with this sort of work whenever I can, because it's that important. While we may work for competing places, we're really working toward the same goal: improving the way we inform the public about what's going on in our world. That knowledge matters a great deal.
Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?
More and more information is coming at us every day. The deluge is so vast that we need to not just say things are true, but prove those truths with verifiable facts. Data journalism allows for great specificity, and truths based in the scientific method. Using computers to commit data journalism allows us to process great amounts of information much more efficiently, and make the world more comprehensible to a user.
Also, while we are working with big data, often only a subset of that data is valuable to a specific user. Data journalism and Web deve[…]
Data
Gov_2.0
Publishing
datajournalism
dataproduct
datascience
datavisualization
nicarinterview
openmapping
from google
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.
Michelle Minkoff (@MichelleMinkoff ) is an investigative developer/journalist based in Washington, D.C. Our interview follows.
Where do you work now? What is a day in your life like?
I am an Interactive Producer at the Associated Press' Washington DC bureau, where I focus on news applications related to politics and the election, as well as general mapping for our interactives on the Web. While my days pretty much always involve sitting in front of a computer, the actual tasks themselves can vary wildly. I may be chatting with reporters and editors in politics, environment, educational, national security or myriad beats about upcoming stories and how to use data to support reporting or create interactive stories. I might be gathering data, reformatting it or crafting Web applications. I spend a great deal of time creating interactive mapping systems, working a lot with geographic data, and collaborating with cartographers, editors and designers to decide how to best display it.
I split my time between working closely with my colleagues in the Washington bureau on the reporting/editing side, and my fellow interactive team members, only one of whom is also in DC. Our team is global, headquartered in New York, but with members spanning the globe from Phoenix to Bangkok.
It's a question of walking a balance between what needs to be done on daily deadlines for breaking news, longer-term stories which are often investigative, and creating frameworks that help The Associated Press to make the most of the Web's interactive nature in the long run.
How did you get started in data journalism? Did you get any special degrees or certificates?
I caught the bug when I took a computer-assisted reporting class from Derek Willis, a member of the New York Times' Interactive News Team, at Northwestern's journalism school where I was a grad student. I was fascinated by the role that technology could play in journalism for reporting and presentation, and very quickly got hooked. I also quickly discovered that I could lose track of hours playing with these tools, and that what came naturally to me was not as natural to others. I would spend days reporting for class, on and off Capitol Hill, and nights exchanging gchats with Derek and other data journalists he introduced me to. I started to understand SQL, advanced Excel, and fairly quickly thereafter, Python and Django.
I followed this up with an independent study in data visualization back at Medill's Chicago campus, under Rich Gordon. I practiced making Django apps, played with the Processing visualization language. I voraciously read through all the Tufte books. As a final project, I created a package about the persistence of Chicago art galleries that encompasses text, Flash visualization and a searchable database.
I have a concentration in Interactive Journalism, with my Medill masters' degree, but the courses mentioned above are but a partial component of that concentration.
Did you have any mentors? Who? What were the most important resources they shared with you?
The question here is in the wrong tense. I currently "do" have many mentors, and I don't know how I would do my job without what they've shared in the past, and in the present. Derek, mentioned above, was the first. He introduced me to his friend Matt [Waite], and then he told me there was a whole group of people doing this work at NICAR. Literally hundreds of people from that organization have helped me at various places on my journey, and I believe strongly in the mantra of "paying it forward" as they have -- no one can know it all, so we pass on what we've learned, so more people can do even better work.
Other key folks I've had the privilege to work with include all of the Los Angeles Times' Data Desk's members, which includes reporters, editors and Web developers. I worked most closely with Ben Welsh and Ken Schwencke, who answered many questions, and were extremely encouraging when I was at the very beginning of my journey.
At my current job at The Associated Press, I'm lucky to have teammates who mentor me in design, mapping and various Washington-based beats. Each is helpful in his or her own way.
Special attention deserves to be called to Jonathan Stray, who's my official boss, but also a fantastic mentor who enables me to do what I do. He's helping me to learn the appropriate technical skills to execute what I see in my head, as well as learn how to learn. He's not just teaching me the answers to the problems we encounter in our daily work, but also helping me learn how to better solve them, and work this whole "thing I do" into a sustainable career path. And all with more patience than I have for myself.
What does your personal data journalism "stack" look like? What tools could you not live without?
No matter how advanced our tools get, I always find myself coming back to Excel first to do simple work. It helps us an overall handle on a data set. I also will often quickly bring data into SQLite, a Firefox extension that allows a user to run SQL queries, with no database setup. I'm more comfortable asking complicated questions of data that way. I also like to use Google's Chart Tools to create quick visualizations for myself to better understand a story.
When it comes to presentation, since I've been doing a lot with mapping recently, I don't know what I'd do without my favorite open source tools, Tilemill and Leaflet. Building a map stack is hard work, but the work that others have done before it have made it a lot easier.
If we consider programming languages tools (which I do), JavaScript is my new Swiss army knife. Prior to coming to the AP, I did a lot with Python and Django, but I've learned a lot about what I like to call "Really Hard JavaScript." It's not just about manipulating the colors of a background on a Web page, but parsing, analyzing and presenting data. When I need to do more complex work to manipulate data, I use a combination of Ruby and Python -- depending on which has better tools for the job. For XML parsing, I like Ruby more. For simplifying geo data, I prefer Python.
What data journalism project are you the most proud of working on or creating?
That would be " Road to 270", a project we did at the AP that allows users to test out hypothetical "what-if" scenarios for the national election, painting states to define to which candidate a state's delegates could go. It combines demographic and past election data with the ability for users to make a choice and deeply engage with the interactive. It's not just telling the user a story, but informing the user by allowing him or her to be part of the story. That, I believe, is when data journalism becomes its most compelling and informative.
It also uses some advanced technical mapping skills that were new to me. I greatly enjoyed the thrill of learning how to structure a complex application, and add new tools to my toolkit. Now, I don't just have those new tools, but a better understanding of how to add other new tools.
Where do you turn to keep your skills updated or learn new things?
I look at other projects, both within the journalism industry and in general visualization communities. The Web inspector is my best friend. I'm always looking to see how people did things. I read blogs voraciously, and have a fairly robust Google Reader set of people whose work I follow closely. I also use lynda.com frequently (I tend to learn best by video tutorials.) Hanging out on listservs for free tools I use (such as Leaflet), programming languages I care about (Python), or projects whose mission our work is related to (Sunlight Foundation) help me engage with a community that cares about similar issues.
Help sites like Stack Overflow, and pretty much anything I can find on Google, are my other best friends. The not-so-secret secret of data journalism: we're learning as we go. That's part of what makes it so fun.
Really, the learning is not about paper or electronic resources. Like so much of journalism, this is best conquered, I argue, with persistence and stick-to-it-ness. I approach the process of data journalism and Web development as a beat. We attend key meetings. Instead of city council, it's NICAR. We develop vast rolodexes. I know people who have myriad specialties and feel comfortable calling on them. In return, I help people all over the world with this sort of work whenever I can, because it's that important. While we may work for competing places, we're really working toward the same goal: improving the way we inform the public about what's going on in our world. That knowledge matters a great deal.
Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?
More and more information is coming at us every day. The deluge is so vast that we need to not just say things are true, but prove those truths with verifiable facts. Data journalism allows for great specificity, and truths based in the scientific method. Using computers to commit data journalism allows us to process great amounts of information much more efficiently, and make the world more comprehensible to a user.
Also, while we are working with big data, often only a subset of that data is valuable to a specific user. Data journalism and Web deve[…]
march 2012 by rahuldave
Profile of the Data Journalist: The Human Algorithm
march 2012 by rahuldave
Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.
Ben Welsh (@palewire) is an Web developer and journalist based in Los Angeles. Our interview follows.
Where do you work now? What is a day in your life like?
I work for the Los Angeles Times, a daily
newspaper and 24-hour Web site based in Southern California. I'm a member
of the Data Desk, a team of reporters and
Web developers that specializes in maps, databases, analysis and
visualization. We both build Web applications and conduct analysis for
reporting projects.
I like to compare The Times to a factory, a factory that makes information.
Metaphorically speaking, it has all sorts of different assembly lines. Just
to list a few, one makes beautifully rendered narratives, another makes battleship-like investigative projects.
A typical day involves juggling work on difference projects, mentally
moving from one assembly line to the other. Today I patched an embryonic open-source release, discussed our next move on a pending public records request, guided the real-time publication of results from the GOP primaries in Michigan and Arizona, and did some preparation for how we'll present a larger dump of results on Super Tuesday.
How did you get started in data journalism? Did you get any special
degrees or certificates?
I'm thrilled to see new-found interest in "data journalism" online. It's
drawing young, bright people into the field and involving people from
different domains. But it should be said that the idea isn't new.
I was initiated into the field as a graduate student at the Missouri School
of Journalism. There I worked at the National Institute for Computer-Assisted Reporting , also known as NICAR. Decades before anyone called it "data journalism," a disparate group of misfit reporters discovered that the data analysis made possible by computers enabled them to do more powerful investigative reporting. In 1989, they founded NICAR, which has, for decades, been training data skills
to journalists and nurtured a tribe of journalism geeks. In the time since, computerized data analysis has become a dominant force in investigative reporting, responsible for a large share of the field's best work.
To underscore my point, here's a 1986 Time magazine article about how
"newsmen are enlisting the machine."
Did you have any mentors? Who? What were the most important resources they
shared with you?
My first journalism job was in Chicago. I got a gig working for two great people there, Carol Marin and Don Moseley, who have spent most of their careers as television journalists. I worked as their assistant. Carol and Don are warm people who are good teachers, but they are also excellent at what they do. There was a moment when I realized, "Hey, I can do this!" It wasn't just something I heard about in class, but I could actually see myself doing.
At Missouri, I had a great classmate named Brian
Hamman, who is now at the New York Times. I remember seeing how invested Brian was in the Web, totally committed to Web development as a career path. When an opportunity opened up to be a graduate assistant at NICAR, Brian encouraged me to pursue it. I learned enough SQL to help do farmed-out investigative work for TV stations. And, more importantly, I learned that if you had technical skills you could get the job to work on a cool story.
After that I got a job doing data analysis at the Center for Public Integrity in Washington DC. I had the opportunity to work on investigative projects, but also the chance to learn a lot of computer programming along the way. I had the guidance of my talented coworkers, Daniel Lathrop, Agustin Armendariz, John Perry, Richard Mullins and Helena Bengtsson. I learned that computer programming wasn't impossible. They taught me that if you have a manageable task, a few friends to help you out and a door you can close, you can figure out a lot.
What does your personal data journalism "stack" look like? What tools
could you not live without?
I do my daily development in gedit text editor, Byobu's slick implementation of the screen terminal and the Chromium browser. And, this part may be hard to believe, but I love Ubuntu
Unity. I don't understand what everybody is complaining about.
I do almost all of my data management in the Python Web development
framework Django and
PostgreSQL's database, even if
the work is an exploratory reporting project that will never be published. I find that the structure of the framework can be useful for organizing just about any data-driven project.
I use GitHub for both version-control and
project management. Without it, I'd be lost.
What data journalism project are you the most proud of working on or
creating?
As we all know, there's a lot of data out there. And, as anyone who works
with it knows, most of it is crap. The projects I'm most proud of have
taken large, ugly data sets and refined them into something worth knowing:
a nut graf in an investigative story, or a
data-driven app that gives the reader some new
insight into the world around them. It's impossible to pick one. I like to
think the best is still, as they say in the newspaper business,
TK.
Where do you turn to keep your skills updated or learn new things?
Twitter is a great way to keep up with what is getting other programmers excited. I know a lot of people find social media overwhelming or distracting, but I feel plugged in and inspired by what I find there. I wouldn't want to live without it.
GitHub is another great source. I've learned so much just exploring other
people's code. It's invaluable.
Why are data journalism and "news apps" important, in the context of the
contemporary digital environment for information?
Computers offer us an opportunity to better master information, better
understand each other and better watchdog those who would govern us. I
tried to talk about some of the ways simply thinking about the process of
journalism as an algorithm can point the way at last week's NICAR
conference in a talk called "Human-Assisted Reporting." In my opinion, we should aspire to write code that embodies the idealistic principles and investigative methods of the previous generation. There's all this data out there now, and journalistic algorithms, "robot
reporters," can help us ask it tougher questions.
Data
Gov_2.0
Publishing
dataconference
datajournalism
dataproduct
datascience
nicarinterview
opensource
programming
from google
To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.
Ben Welsh (@palewire) is an Web developer and journalist based in Los Angeles. Our interview follows.
Where do you work now? What is a day in your life like?
I work for the Los Angeles Times, a daily
newspaper and 24-hour Web site based in Southern California. I'm a member
of the Data Desk, a team of reporters and
Web developers that specializes in maps, databases, analysis and
visualization. We both build Web applications and conduct analysis for
reporting projects.
I like to compare The Times to a factory, a factory that makes information.
Metaphorically speaking, it has all sorts of different assembly lines. Just
to list a few, one makes beautifully rendered narratives, another makes battleship-like investigative projects.
A typical day involves juggling work on difference projects, mentally
moving from one assembly line to the other. Today I patched an embryonic open-source release, discussed our next move on a pending public records request, guided the real-time publication of results from the GOP primaries in Michigan and Arizona, and did some preparation for how we'll present a larger dump of results on Super Tuesday.
How did you get started in data journalism? Did you get any special
degrees or certificates?
I'm thrilled to see new-found interest in "data journalism" online. It's
drawing young, bright people into the field and involving people from
different domains. But it should be said that the idea isn't new.
I was initiated into the field as a graduate student at the Missouri School
of Journalism. There I worked at the National Institute for Computer-Assisted Reporting , also known as NICAR. Decades before anyone called it "data journalism," a disparate group of misfit reporters discovered that the data analysis made possible by computers enabled them to do more powerful investigative reporting. In 1989, they founded NICAR, which has, for decades, been training data skills
to journalists and nurtured a tribe of journalism geeks. In the time since, computerized data analysis has become a dominant force in investigative reporting, responsible for a large share of the field's best work.
To underscore my point, here's a 1986 Time magazine article about how
"newsmen are enlisting the machine."
Did you have any mentors? Who? What were the most important resources they
shared with you?
My first journalism job was in Chicago. I got a gig working for two great people there, Carol Marin and Don Moseley, who have spent most of their careers as television journalists. I worked as their assistant. Carol and Don are warm people who are good teachers, but they are also excellent at what they do. There was a moment when I realized, "Hey, I can do this!" It wasn't just something I heard about in class, but I could actually see myself doing.
At Missouri, I had a great classmate named Brian
Hamman, who is now at the New York Times. I remember seeing how invested Brian was in the Web, totally committed to Web development as a career path. When an opportunity opened up to be a graduate assistant at NICAR, Brian encouraged me to pursue it. I learned enough SQL to help do farmed-out investigative work for TV stations. And, more importantly, I learned that if you had technical skills you could get the job to work on a cool story.
After that I got a job doing data analysis at the Center for Public Integrity in Washington DC. I had the opportunity to work on investigative projects, but also the chance to learn a lot of computer programming along the way. I had the guidance of my talented coworkers, Daniel Lathrop, Agustin Armendariz, John Perry, Richard Mullins and Helena Bengtsson. I learned that computer programming wasn't impossible. They taught me that if you have a manageable task, a few friends to help you out and a door you can close, you can figure out a lot.
What does your personal data journalism "stack" look like? What tools
could you not live without?
I do my daily development in gedit text editor, Byobu's slick implementation of the screen terminal and the Chromium browser. And, this part may be hard to believe, but I love Ubuntu
Unity. I don't understand what everybody is complaining about.
I do almost all of my data management in the Python Web development
framework Django and
PostgreSQL's database, even if
the work is an exploratory reporting project that will never be published. I find that the structure of the framework can be useful for organizing just about any data-driven project.
I use GitHub for both version-control and
project management. Without it, I'd be lost.
What data journalism project are you the most proud of working on or
creating?
As we all know, there's a lot of data out there. And, as anyone who works
with it knows, most of it is crap. The projects I'm most proud of have
taken large, ugly data sets and refined them into something worth knowing:
a nut graf in an investigative story, or a
data-driven app that gives the reader some new
insight into the world around them. It's impossible to pick one. I like to
think the best is still, as they say in the newspaper business,
TK.
Where do you turn to keep your skills updated or learn new things?
Twitter is a great way to keep up with what is getting other programmers excited. I know a lot of people find social media overwhelming or distracting, but I feel plugged in and inspired by what I find there. I wouldn't want to live without it.
GitHub is another great source. I've learned so much just exploring other
people's code. It's invaluable.
Why are data journalism and "news apps" important, in the context of the
contemporary digital environment for information?
Computers offer us an opportunity to better master information, better
understand each other and better watchdog those who would govern us. I
tried to talk about some of the ways simply thinking about the process of
journalism as an algorithm can point the way at last week's NICAR
conference in a talk called "Human-Assisted Reporting." In my opinion, we should aspire to write code that embodies the idealistic principles and investigative methods of the previous generation. There's all this data out there now, and journalistic algorithms, "robot
reporters," can help us ask it tougher questions.
march 2012 by rahuldave
In the age of big data, data journalism has profound importance for society
march 2012 by rahuldave
The promise of data journalism was a strong theme throughout the National Institute for Computer-Assisted Reporting's (NICAR) 2012 conference. In 2012, making sense of big data through narrative and context, particularly unstructured data, will be a central goal for data scientists around the world, whether they work in newsrooms, Wall Street or Silicon Valley. Notably, that goal will be substantially enabled by a growing set of common tools, whether they're employed by government technologists opening Chicago, healthcare technologists or newsroom developers.
At NICAR 2012, you could literally see the code underpinning the future of journalism written - or at least projected - on the walls.
"The energy level was incredible," said David Herzog, associate professor for
print and digital news at the Missouri School of Journalism, in an email interview after NICAR. "I didn't see participants wringing their hands and worrying about the future of journalism. They're too busy building it."
Just as open civic software is increasingly baked into government, open source is playing a pivotal role in the new data journalism.
"Free and open-source tools dominated," said Herzog. "It's clear from the panels and hands-on classes that free and open source tools have eliminated the barrier to entry in terms of many software costs."
While many developers are agnostic with respect to which tools they use to get a job done, the people who are building and sharing tools for data journalism are often doing it with open source code. As Dan Sinker, the head of the Knight-Mozilla News Technology Partnership for Mozilla, wrote afterwards, journo-coders took NICAR 12 "to a whole new level."
The PANDA Project officially launched in beta in St. Louis, including a provisioning party. You can check out PANDA on Github.
The Associated Press' Overview Project is going to make digging through documents easier.
The LA Times Datadesk shared a tool, Django Bakery, to turn Django applications into flat HTML files.
Jonathan Soma, with help from John Keefe of WNYC, built Tabletop.js as a way to use a public Google Spreadsheet as a source of data for a Web app. Tabletop is on Github too.
While some of that open source development was definitely driven by the requirements of the Knight News Challenge, which funded the PANDA and Overview projects, there's also a collaborative spirit in evidence throughout this community.
This is a group of people who are fiercely committed to "showing your work" -- and for newsroom developers, that means sharing your code. To put it another way, code, don't tell. Sessions on Python, Django, mapping, Google Refine and Google Fusion tables were packed at NICAR 12.
Journalists want to learn python: #nicar12 twitter.com/MeghanHoyer/st
— Meghan Hoyer (@MeghanHoyer) February 24, 2012
No, this is not your father's computer-assisted reporting.
"I thought this stacked up as the best NICAR conference since the first in 1993," said Herzog. "It's always been tough to choose from the menu of panels, demos and hands-on classes at NICAR conferences. But I thought there was an abundance of great, informative, sessions put on by the participants. Also, I think NICAR offered a good range of options for newbies and experts alike. For instance, attendees could learn how to map using Google Fusion tables on the beginner's end, or PostGIS and qGIS at the advanced level. Harvesting data through web scraping has become an ever bigger deal for data journalists. At the same time, it's getting easier for folks with no or little programming chops to scrape using tools like spreadsheets, Google Refine and ScraperWiki. "
We've scraped a webpage using a simple chrome extension with @michelleminkoff at #nicar12 twitter.com/ChrisLKeller/s
— Chris Keller (@ChrisLKeller) February 25, 2012
On the history of NICAR
According to IRE, NICAR was founded in 1989. Since its founding, the Institute has trained thousands of journalists how to find, collect and public electronic information.
Today, "the NICAR conference helps journalists, hackers, and developers figure out best practices, best methods,and best digital tools for doing journalism that involves data analysis and classic reporting in the field," said Brant Houston, former executive director of Investigative Reporters and Editors, in an email interview. "The NICAR conference also obviously includes investigative journalism and the standards for data integrity and credibility."
"I believe the first IRE-sponsored [conference] was in 1993 in Raleigh, when a few reporters were trying to acquire and learn to use spreadsheets, database
managers, etc. on newly open electronic records," said Sarah Cohen, the Knight professor of the practice of journalism and public policy at Duke University, in an email interview. "Elliott Jaspin was going around the country teaching reporters how to get data off of 9-track tapes. There really was no public Internet. At the time, it was really, really hard to use the new PC's, and a few reporters were trying to find new stories. The famous ones had been Elliott's school bus drivers who had drunk driving records and the Atlanta Color of Money series on redlining."
"St. Louis was my 10th NICAR conference," said Anthony DeBarros, the senior database editor at USA Today, in an email interview. "My first was in 1999 in Boston. The conference is a place where news nerds can gather and remind themselves that they're not alone in their love of numbers, data analysis, writing code and finding great stories by poring over columns in a spreadsheet. It serves as an important training vehicle for journalists getting started with data in the newsroom, and it's always kept journalists apprised of technological developments that offer new ways of finding and telling stories. At the same time, its connection to IRE keeps it firmly rooted in the best aspects of investigative reporting -- digging up stories that serve the public good.
Baby, you can drive my CAR
Long before we started talking about "data journalism," the practice of computer-assisted reporting (CAR) was growing around the world.
"The practice of CAR has changed over time as the tools and environment in the digital world has changed," said Houston. "So it began in the time of mainframes in the late 60s and then moved onto PCs (which increased speed and flexibility of analysis and presentation) and then moved onto the Web, which accelerated the ability to gather, analyze and present data. The basic goals have remained the same. To sift through data and make sense of it, often with social science methods. CAR tends to be an "umbrella" term - one that includes precision journalism and data driven journalism and any methodology that makes sense of date such as visualization and effective presentations of data."
On one level, CAR is still around because the journalism world hasn't coined a good term to use instead.
"Computer-assisted reporting" is an antiquated term, but most people who practice it have recognized that for years," said DeBarros. "It sticks around because no one has yet to come up with a dynamite replacement. Phil Meyer, the godfather of the movement, wrote a seminal book called "Precision Journalism, and that term is a good one to describe that segment of CAR that deals with statistics and the use of social science methods in newsgathering. As an umbrella term, data journalism seems to be the best description at the moment, probably because it adequately covers most of the areas that CAR has become -- from traditional data-driven reporting to the newer category of news applications."
The most significant shift in CAR may well be when all of those computers being used for reporting were connected through the network of networks in the 1990s.
"It may seem obvious, but of course the Internet changed it all, and for a while it got smushed in with trying to learn how to navigate the Internet for stories, and how to download data," said Cohen. "Then there was a stage when everyone was building internal intranets to deliver public records inside newsrooms to help find people on deadline, etc. So for much of the time, it was focused on reporting, not publishing or presentation. Now the data journalism folks have emerged from the other direction: People who are using data obtained through APIs who often skip the reporting side, and use the same techniques to deliver unfiltered information to their readers in an easier format the the government is giving us. But I think it's starting to come back together -- the so-called data journalists are getting more interested in reporting, and the more traditional CAR reporters are interested in getting their stories on the web in more interesting ways.
Whatever you call it, the goals are still the same.
"CAR has always been about using data to find and tell stories," said DeBarros. "And it still is. What has changed in recent years is more emphasis toward online presentations (interactive maps and applications) and the coding skills required to produce them (JavaScript, HTML/CSS, Django, Ruby on Rails). Earlier NICAR conferences revolved much more around the best stories of the year and how to use data techniques to cover particular topics and beats. That's still in place. But more recently, the conference and the practice has widened to include much more coding and presentation topics. That reflects the state of media -- every newsroom is working overtime to make its content work well on the web, on mobile, and on apps, and data journalists tend to be forward thinkers so it's not surprising that the conference would expand to include those topics."
What stood out at NICAR 2012?
The tools and tactics on display at NICAR were enough to convince Tyler Dukes at Duke to write that NICAR taught me I know nothing. Browse through the tools, slides and links from NICAR 2012 curated by Chrys Wu to get a sense of just h[…]
Data
Gov_2.0
Publishing
datajournalism
dataproduct
datascience
nicarinterview
opengovernment
from google
At NICAR 2012, you could literally see the code underpinning the future of journalism written - or at least projected - on the walls.
"The energy level was incredible," said David Herzog, associate professor for
print and digital news at the Missouri School of Journalism, in an email interview after NICAR. "I didn't see participants wringing their hands and worrying about the future of journalism. They're too busy building it."
Just as open civic software is increasingly baked into government, open source is playing a pivotal role in the new data journalism.
"Free and open-source tools dominated," said Herzog. "It's clear from the panels and hands-on classes that free and open source tools have eliminated the barrier to entry in terms of many software costs."
While many developers are agnostic with respect to which tools they use to get a job done, the people who are building and sharing tools for data journalism are often doing it with open source code. As Dan Sinker, the head of the Knight-Mozilla News Technology Partnership for Mozilla, wrote afterwards, journo-coders took NICAR 12 "to a whole new level."
The PANDA Project officially launched in beta in St. Louis, including a provisioning party. You can check out PANDA on Github.
The Associated Press' Overview Project is going to make digging through documents easier.
The LA Times Datadesk shared a tool, Django Bakery, to turn Django applications into flat HTML files.
Jonathan Soma, with help from John Keefe of WNYC, built Tabletop.js as a way to use a public Google Spreadsheet as a source of data for a Web app. Tabletop is on Github too.
While some of that open source development was definitely driven by the requirements of the Knight News Challenge, which funded the PANDA and Overview projects, there's also a collaborative spirit in evidence throughout this community.
This is a group of people who are fiercely committed to "showing your work" -- and for newsroom developers, that means sharing your code. To put it another way, code, don't tell. Sessions on Python, Django, mapping, Google Refine and Google Fusion tables were packed at NICAR 12.
Journalists want to learn python: #nicar12 twitter.com/MeghanHoyer/st
— Meghan Hoyer (@MeghanHoyer) February 24, 2012
No, this is not your father's computer-assisted reporting.
"I thought this stacked up as the best NICAR conference since the first in 1993," said Herzog. "It's always been tough to choose from the menu of panels, demos and hands-on classes at NICAR conferences. But I thought there was an abundance of great, informative, sessions put on by the participants. Also, I think NICAR offered a good range of options for newbies and experts alike. For instance, attendees could learn how to map using Google Fusion tables on the beginner's end, or PostGIS and qGIS at the advanced level. Harvesting data through web scraping has become an ever bigger deal for data journalists. At the same time, it's getting easier for folks with no or little programming chops to scrape using tools like spreadsheets, Google Refine and ScraperWiki. "
We've scraped a webpage using a simple chrome extension with @michelleminkoff at #nicar12 twitter.com/ChrisLKeller/s
— Chris Keller (@ChrisLKeller) February 25, 2012
On the history of NICAR
According to IRE, NICAR was founded in 1989. Since its founding, the Institute has trained thousands of journalists how to find, collect and public electronic information.
Today, "the NICAR conference helps journalists, hackers, and developers figure out best practices, best methods,and best digital tools for doing journalism that involves data analysis and classic reporting in the field," said Brant Houston, former executive director of Investigative Reporters and Editors, in an email interview. "The NICAR conference also obviously includes investigative journalism and the standards for data integrity and credibility."
"I believe the first IRE-sponsored [conference] was in 1993 in Raleigh, when a few reporters were trying to acquire and learn to use spreadsheets, database
managers, etc. on newly open electronic records," said Sarah Cohen, the Knight professor of the practice of journalism and public policy at Duke University, in an email interview. "Elliott Jaspin was going around the country teaching reporters how to get data off of 9-track tapes. There really was no public Internet. At the time, it was really, really hard to use the new PC's, and a few reporters were trying to find new stories. The famous ones had been Elliott's school bus drivers who had drunk driving records and the Atlanta Color of Money series on redlining."
"St. Louis was my 10th NICAR conference," said Anthony DeBarros, the senior database editor at USA Today, in an email interview. "My first was in 1999 in Boston. The conference is a place where news nerds can gather and remind themselves that they're not alone in their love of numbers, data analysis, writing code and finding great stories by poring over columns in a spreadsheet. It serves as an important training vehicle for journalists getting started with data in the newsroom, and it's always kept journalists apprised of technological developments that offer new ways of finding and telling stories. At the same time, its connection to IRE keeps it firmly rooted in the best aspects of investigative reporting -- digging up stories that serve the public good.
Baby, you can drive my CAR
Long before we started talking about "data journalism," the practice of computer-assisted reporting (CAR) was growing around the world.
"The practice of CAR has changed over time as the tools and environment in the digital world has changed," said Houston. "So it began in the time of mainframes in the late 60s and then moved onto PCs (which increased speed and flexibility of analysis and presentation) and then moved onto the Web, which accelerated the ability to gather, analyze and present data. The basic goals have remained the same. To sift through data and make sense of it, often with social science methods. CAR tends to be an "umbrella" term - one that includes precision journalism and data driven journalism and any methodology that makes sense of date such as visualization and effective presentations of data."
On one level, CAR is still around because the journalism world hasn't coined a good term to use instead.
"Computer-assisted reporting" is an antiquated term, but most people who practice it have recognized that for years," said DeBarros. "It sticks around because no one has yet to come up with a dynamite replacement. Phil Meyer, the godfather of the movement, wrote a seminal book called "Precision Journalism, and that term is a good one to describe that segment of CAR that deals with statistics and the use of social science methods in newsgathering. As an umbrella term, data journalism seems to be the best description at the moment, probably because it adequately covers most of the areas that CAR has become -- from traditional data-driven reporting to the newer category of news applications."
The most significant shift in CAR may well be when all of those computers being used for reporting were connected through the network of networks in the 1990s.
"It may seem obvious, but of course the Internet changed it all, and for a while it got smushed in with trying to learn how to navigate the Internet for stories, and how to download data," said Cohen. "Then there was a stage when everyone was building internal intranets to deliver public records inside newsrooms to help find people on deadline, etc. So for much of the time, it was focused on reporting, not publishing or presentation. Now the data journalism folks have emerged from the other direction: People who are using data obtained through APIs who often skip the reporting side, and use the same techniques to deliver unfiltered information to their readers in an easier format the the government is giving us. But I think it's starting to come back together -- the so-called data journalists are getting more interested in reporting, and the more traditional CAR reporters are interested in getting their stories on the web in more interesting ways.
Whatever you call it, the goals are still the same.
"CAR has always been about using data to find and tell stories," said DeBarros. "And it still is. What has changed in recent years is more emphasis toward online presentations (interactive maps and applications) and the coding skills required to produce them (JavaScript, HTML/CSS, Django, Ruby on Rails). Earlier NICAR conferences revolved much more around the best stories of the year and how to use data techniques to cover particular topics and beats. That's still in place. But more recently, the conference and the practice has widened to include much more coding and presentation topics. That reflects the state of media -- every newsroom is working overtime to make its content work well on the web, on mobile, and on apps, and data journalists tend to be forward thinkers so it's not surprising that the conference would expand to include those topics."
What stood out at NICAR 2012?
The tools and tactics on display at NICAR were enough to convince Tyler Dukes at Duke to write that NICAR taught me I know nothing. Browse through the tools, slides and links from NICAR 2012 curated by Chrys Wu to get a sense of just h[…]
march 2012 by rahuldave
Strata Week: Datasift lets you mine two years of Twitter data
march 2012 by rahuldave
Here are a few of the data stories that caught my attention this week.
Twitter's historical archives, via Datasift
Datasift, one of the two companies that has official access to the Twitter firehose (the other being Gnip) announced its new Historics service this week, giving customers access to up to two years' worth of historical Tweets. (By comparison, Gnip offers 30 days of Twitter data, and other developers and users have access to roughly a week's worth of Tweets.)
GigaOm's Barb Darrow responded to those who might be skeptical about the relevance of this sort of historic Twitter data in a service that emphasizes real-time. Darrow noted that DataSift CEO Rob Bailey said companies planning new products, promotions or price changes would do well to study the impact of their past actions before proceeding and that Twitter is the perfect venue for that.
Another indication of the desirability of this new Twitter data: the waiting list for Historics already includes a number of Fortune 500 companies. The service will get its official launch in April.
Strata Santa Clara 2012 Complete Video CompilationThe Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.
Building a school of data
Although there are plenty of ways to receive formal training in math, statistics and engineering, there aren't a lot of options when it comes to an education specifically in data science.
To that end, the Open Knowledge Foundation and Peer to Peer University (P2PU) have proposed a School of Data, arguing that:
"It will be years before data specialist degree paths become broadly available and accepted, and even then, time-intensive degree courses may not be the right option for journalists, activists, or computer programmers who just need to add data skills to their existing expertise. What is needed are flexible, on-demand, shorter learning options for people who are actively working in areas that benefit from data skills, particularly those who may have already left formal education programmes."
The organizations are seeking volunteers to help develop the project, whether that's in the form of educational materials, learning challenges, mentorship, or a potential student body.
Strata in California
The Strata Conference wraps up today in Santa Clara, Calif. If you missed Strata this year and weren't able to catch the livestream of the conference, look for excerpts and videos posted here on Radar and through the O'Reilly YouTube channel in the coming weeks.
And be sure to make plans for Strata New York, being held October 23-25. That event will mark the merger with Hadoop World. The call for speaker proposals for Strata NY is now open.
Got data news?
Feel free to email me.
Related:
Building data science teams
Big bucks for DataSift and for data from Twitter's firehose
The challenges of streaming real-time data
More Strata Week coverage
Data
datascience
datascientists
datasift
schoolofdata
strataweek
twitter
twitterarchive
from google
Twitter's historical archives, via Datasift
Datasift, one of the two companies that has official access to the Twitter firehose (the other being Gnip) announced its new Historics service this week, giving customers access to up to two years' worth of historical Tweets. (By comparison, Gnip offers 30 days of Twitter data, and other developers and users have access to roughly a week's worth of Tweets.)
GigaOm's Barb Darrow responded to those who might be skeptical about the relevance of this sort of historic Twitter data in a service that emphasizes real-time. Darrow noted that DataSift CEO Rob Bailey said companies planning new products, promotions or price changes would do well to study the impact of their past actions before proceeding and that Twitter is the perfect venue for that.
Another indication of the desirability of this new Twitter data: the waiting list for Historics already includes a number of Fortune 500 companies. The service will get its official launch in April.
Strata Santa Clara 2012 Complete Video CompilationThe Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.
Building a school of data
Although there are plenty of ways to receive formal training in math, statistics and engineering, there aren't a lot of options when it comes to an education specifically in data science.
To that end, the Open Knowledge Foundation and Peer to Peer University (P2PU) have proposed a School of Data, arguing that:
"It will be years before data specialist degree paths become broadly available and accepted, and even then, time-intensive degree courses may not be the right option for journalists, activists, or computer programmers who just need to add data skills to their existing expertise. What is needed are flexible, on-demand, shorter learning options for people who are actively working in areas that benefit from data skills, particularly those who may have already left formal education programmes."
The organizations are seeking volunteers to help develop the project, whether that's in the form of educational materials, learning challenges, mentorship, or a potential student body.
Strata in California
The Strata Conference wraps up today in Santa Clara, Calif. If you missed Strata this year and weren't able to catch the livestream of the conference, look for excerpts and videos posted here on Radar and through the O'Reilly YouTube channel in the coming weeks.
And be sure to make plans for Strata New York, being held October 23-25. That event will mark the merger with Hadoop World. The call for speaker proposals for Strata NY is now open.
Got data news?
Feel free to email me.
Related:
Building data science teams
Big bucks for DataSift and for data from Twitter's firehose
The challenges of streaming real-time data
More Strata Week coverage
march 2012 by rahuldave
Everyone has a big data problem
february 2012 by rahuldave
Jonathan Gosier (@jongos), designer, developer and the co-founder of metaLayer.com, says the big data deluge presents problems for everyone, not just corporations and governments.
Gosier will be speaking at next week's Strata conference on "The Democratization of Data Platforms." In the following interview, he discusses the challenges and opportunities data democratization creates.
Your keynote is going to be about "everyone's" big data problems. If everyone really does have their own big data problem, how are we going to democratize big data tools and processes? It seems that our various problems would require many different solutions.
Jonathan Gosier: It's a problem for everyone because data problems can manifest in a multitude of ways: too much email, too many passwords to remember, a deluge of legal documents related to a mortgage, or simply knowing where to look online for the answers to simple questions.
You're absolutely correct in noting that each of these problems requires different solutions. However, many of these solutions tend not to be accessible to the average person, whether this is because of prices or a level of expertise required to use the tools available.
There is a lot of talk about a "digital divide," but there's a growing "data divide" as well. It's no longer about having basic computer literacy skills. Being able to understand what data is available, how it can be manipulated, and how it can be used to actually improve one's life is a skill that not everyone possesses.
There's an opportunity here for growth as well. If you look at the market, there are tools for visualizing personal finance (think Mint.com or HelloWallet), personal health (23andMe), personal productivity (Basecamp), etc. But the overarching trend is that there is a growing need for products that simplify the wealth of information around people. The simplest way to do this is often through visuals.
Why are visualizations so important to a better understanding of data?
Jonathan Gosier: Visualizations are only "better" in that they can relate complex ideas to a general audience. Visualization is by no means a replacement for expertise and research. It simply represents a method for communicating across barriers of knowledge.
But beyond that, the problem with a lot of the data visuals on the web is that they are static, pre-constructed, and vague about their data sources. This means the general public either has to take what's presented on face value and agree or disagree, or they have to conduct their own research.
There's a need for "living infographics" — visualizations that are inviting and easy to understand, but are shared with the underlying data used to create them. This allows the casual consumer to simply admire the visual while the more discerning audience can actually analyze the underlying data to see if the message being presented is consistent with their findings.
It's far more transparent and credible to reveal, versus conceal, one's sources.
One of the pushbacks to data democratization efforts is that people might not know how to use these tools correctly and/or they might use them to further their own agendas. How do you respond to that?
Jonathan Gosier: The question illustrates the point, actually. It wasn't so long ago that the same could be said about the printing press. It was an innovation, but initially, it was so expensive that it was a technology that was only available to the elite and wealthy. Now it's common (at least in the Western world) for any given middle-class household to contain an inexpensive printing device. The web radicalized things even more, essentially turning anyone with access into a publisher.
So the question becomes, was it good or bad that publishing became something that anyone could do versus a select few? I'd argue that, ultimately, the pros have out-weighed the cons by magnitudes.
Right now data can be thought of as an asset of the elite and privileged. Those with wealth pay a lot for it, and those who are highly skilled can charge a great deal for their services around it. But the reality is, there is a huge portion of the market that has a legitimate need for data solutions that aren't currently available to them.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.
Save 20% on registration with the code RADAR20
Related:
Data for the public good
Don't blame the information for your bad habits
Data science democratized
Data
bigdata
datademocratization
datatools
from google
Gosier will be speaking at next week's Strata conference on "The Democratization of Data Platforms." In the following interview, he discusses the challenges and opportunities data democratization creates.
Your keynote is going to be about "everyone's" big data problems. If everyone really does have their own big data problem, how are we going to democratize big data tools and processes? It seems that our various problems would require many different solutions.
Jonathan Gosier: It's a problem for everyone because data problems can manifest in a multitude of ways: too much email, too many passwords to remember, a deluge of legal documents related to a mortgage, or simply knowing where to look online for the answers to simple questions.
You're absolutely correct in noting that each of these problems requires different solutions. However, many of these solutions tend not to be accessible to the average person, whether this is because of prices or a level of expertise required to use the tools available.
There is a lot of talk about a "digital divide," but there's a growing "data divide" as well. It's no longer about having basic computer literacy skills. Being able to understand what data is available, how it can be manipulated, and how it can be used to actually improve one's life is a skill that not everyone possesses.
There's an opportunity here for growth as well. If you look at the market, there are tools for visualizing personal finance (think Mint.com or HelloWallet), personal health (23andMe), personal productivity (Basecamp), etc. But the overarching trend is that there is a growing need for products that simplify the wealth of information around people. The simplest way to do this is often through visuals.
Why are visualizations so important to a better understanding of data?
Jonathan Gosier: Visualizations are only "better" in that they can relate complex ideas to a general audience. Visualization is by no means a replacement for expertise and research. It simply represents a method for communicating across barriers of knowledge.
But beyond that, the problem with a lot of the data visuals on the web is that they are static, pre-constructed, and vague about their data sources. This means the general public either has to take what's presented on face value and agree or disagree, or they have to conduct their own research.
There's a need for "living infographics" — visualizations that are inviting and easy to understand, but are shared with the underlying data used to create them. This allows the casual consumer to simply admire the visual while the more discerning audience can actually analyze the underlying data to see if the message being presented is consistent with their findings.
It's far more transparent and credible to reveal, versus conceal, one's sources.
One of the pushbacks to data democratization efforts is that people might not know how to use these tools correctly and/or they might use them to further their own agendas. How do you respond to that?
Jonathan Gosier: The question illustrates the point, actually. It wasn't so long ago that the same could be said about the printing press. It was an innovation, but initially, it was so expensive that it was a technology that was only available to the elite and wealthy. Now it's common (at least in the Western world) for any given middle-class household to contain an inexpensive printing device. The web radicalized things even more, essentially turning anyone with access into a publisher.
So the question becomes, was it good or bad that publishing became something that anyone could do versus a select few? I'd argue that, ultimately, the pros have out-weighed the cons by magnitudes.
Right now data can be thought of as an asset of the elite and privileged. Those with wealth pay a lot for it, and those who are highly skilled can charge a great deal for their services around it. But the reality is, there is a huge portion of the market that has a legitimate need for data solutions that aren't currently available to them.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.
Save 20% on registration with the code RADAR20
Related:
Data for the public good
Don't blame the information for your bad habits
Data science democratized
february 2012 by rahuldave
Big data in the cloud
february 2012 by rahuldave
Sections
IaaS and private clouds
Platform solutions
Big data cloud platforms compared
Conclusion
Big data and cloud technology go hand-in-hand. Big data needs clusters
of servers for processing, which clouds can readily provide. So goes
the marketing message, but what does that look like in reality? Both
"cloud" and "big data" have broad definitions, obscured by considerable
hype. This article breaks down the landscape as simply as possible,
highlighting what's practical, and what's to come.
IaaS and private clouds
What is often called "cloud" amounts to virtualized servers: computing
resource that presents itself as a regular server, rentable per
consumption. This is generally called infrastructure as a service
(IaaS), and is offered by platforms such as Rackspace Cloud or Amazon
EC2. You buy time on these services, and install and configure your
own software, such as a Hadoop cluster or NoSQL database. Most of the
solutions I described in my Big Data Market Survey can be deployed on
IaaS services.
Using IaaS clouds doesn't mean you must handle all deployment
manually: good news for the clusters of machines big data
requires. You can use orchestration frameworks, which handle the
management of resources, and automated infrastructure tools, which
handle server installation and configuration. RightScale offers a
commercial multi-cloud management platform that mitigates some of the
problems of managing servers in the cloud.
Frameworks such as OpenStack and Eucalyptus aim to present a uniform
interface to both private data centers and the public
cloud. Attracting a strong flow of cross industry support, OpenStack
currently addresses computing resource (akin to Amazon's EC2) and
storage (parallels Amazon S3).
The race is on to make private clouds and IaaS services more usable:
over the next two years using clouds should become much more
straightforward as vendors adopt the nascent standards. There'll be a
uniform interface, whether you're using public or private cloud
facilities, or a hybrid of the two.
Particular to big data, several configuration tools already target
Hadoop explicitly: among them Dell's Crowbar, which aims to make
deploying and configuring clusters simple, and Apache Whirr, which is
specialized for running Hadoop services and other clustered data processing systems.
Today, using IaaS gives you a broad choice of cloud supplier, the
option of using a private cloud, and complete control: but you'll be
responsible for deploying, managing and maintaining your clusters.
Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.
Platform solutions
Using IaaS only brings you so far for with big data applications: they
handle the creation of computing and storage resources, but don't
address anything at a higher level. The set up of Hadoop and Hive or a
similar solution is down to you.
Beyond IaaS, several cloud services provide application layer support
for big data work. Sometimes referred to as managed solutions, or
platform as a service (PaaS), these services remove the need to
configure or scale things such as databases or MapReduce, reducing
your workload and maintenance burden. Additionally, PaaS providers can
realize great efficiencies by hosting at the application level, and
pass those savings on to the customer.
The general PaaS market is burgeoning, with major players including
VMware (Cloud Foundry) and Salesforce (Heroku, force.com). As big data
and machine learning requirements percolate through the industry,
these players are likely to add their own big-data-specific
services. For the purposes of this article, though, I will be sticking
to the vendors who already have implemented big data solutions.
Today's primary providers of such big data platform services are
Amazon, Google and Microsoft. You can see their offerings summarized
in the table toward the end of this article. Both Amazon Web Services
and Microsoft's Azure blur the lines between infrastructure as a
service and platform: you can mix and match. By contrast, Google's
philosophy is to skip the notion of a server altogether, and focus
only on the concept of the application. Among these, only Amazon can
lay claim to extensive experience with their product.
Amazon Web Services
Amazon has significant experience in hosting big data processing.
Use of Amazon EC2 for Hadoop was a popular and natural move for many
early adopters of big data, thanks to Amazon's expandable supply of
compute power. Building on this, Amazon launched Elastic Map Reduce
in 2009, providing a hosted, scalable Hadoop service.
Applications on Amazon's platform can pick from the best of both the
IaaS and PaaS worlds. General purpose EC2 servers host applications
that can then access the appropriate special purpose managed solutions
provided by Amazon.
As well as Elastic Map Reduce, Amazon offers several other services
relevant to big data, such as the Simple Queue Service for
coordinating distributed computing, and a hosted relational database service. At the specialist end of big data, Amazon's High Performance Computing solutions are tuned for low-latency cluster computing, of
the sort required by scientific and engineering applications.
Elastic Map Reduce
Elastic Map Reduce (EMR) can be programmed in the usual Hadoop ways,
through Pig, Hive or other programming language, and uses Amazon's S3
storage service to get data in and out.
Access to Elastic Map Reduce is through Amazon's SDKs and tools, or with
GUI analytical and IDE products such as those offered by Karmasphere. In conjunction with these tools, EMR represents a strong
option for experimental and analytical work. Amazon's EMR pricing
makes it a much more attractive option to use EMR, rather than
configure EC2 instances yourself to run Hadoop.
When integrating Hadoop with applications generating structured data,
using S3 as the main data source can be unwieldy. This is because,
similar to Hadoop's HDFS, S3 works at the level of storing blobs
of opaque data. Hadoop's answer to this is HBase, a NoSQL database
that integrates with the rest of the Hadoop stack. Unfortunately,
Amazon does not currently offer HBase with Elastic Map Reduce.
DynamoDB
Instead of HBase, Amazon provides DynamoDB, its own managed, scalable
NoSQL database. As this a managed solution, it represents a better
choice than running your own database on top of EC2, in terms of both
performance and economy.
DynamoDB data can be exported to and imported from S3, providing
interoperability with EMR.
Google
Google's cloud platform stands out as distinct from its
competitors. Rather than offering virtualization, it provides an
application container with defined APIs and services. Developers do
not need to concern themselves with the concept of machines:
applications execute in the cloud, getting access to as much
processing power as they need, within defined resource usage limits.
To use Google's platform, you must work within the constraints of
its APIs. However, if that fits, you can reap the benefits of the
security, tuning and performance improvements inherent to the way
Google develops all its services.
AppEngine, Google's cloud application hosting service, offers a
MapReduce facility for parallel computation over data, but this is
more of a feature for use as part of complex applications rather than
for analytical purposes. Instead, BigQuery and the Prediction API form
the core of Google's big data offering, respectively offering analysis
and machine learning facilities. Both these services are available
exclusively via REST APIs, consistent with Google's vision for
web-based computing.
BigQuery
BigQuery is an analytical database, suitable for interactive analysis
over datasets of the order of 1TB. It works best on a small number of
tables with a large number of rows. BigQuery offers a familiar SQL
interface to its data. In that, it is comparable to Apache Hive, but
the typical performance is faster, making BigQuery a good choice for
exploratory data analysis.
Getting data into BigQuery is a matter of directly uploading it, or
importing it from Google's Cloud Storage system. This is the aspect of
BigQuery with the biggest room for improvement. Whereas Amazon's S3
lets you mail in disks for import, Google doesn't currently have this
facility. Streaming data into BigQuery isn't viable either, so regular
imports are required for constantly updating data. Finally, as
BigQuery only accepts data formatted as comma-separated value (CSV)
files, you will need to use external methods to clean up the data
beforehand.
Rather than provide end-user interfaces itself, Google wants an
ecosystem to grow around BigQuery, with vendors incorporating it into
their products, in the same way Elastic Map Reduce has acquired tool
integration. Currently in beta test, to which anybody can apply,
BigQuery is expected to be publicly available during 2012.
Prediction API
Many uses of machine learning are well defined, such as
classification, sentiment analysis, or recommendation generation. To
meet these needs, Google offers its Prediction API product.
Applications using the Prediction API work by creating and training a
model hosted within Google's system. Once trained, this model can be
used to make predictions, such as spam detection. Google is working
on allowing these models to be shared, optionally with a fee. This
will let you take advantage of previously trained models, which in
many cases will save you time and expertise with training.
Though promising, Google's offerings are in their early days. Further
integration b[…]
Data
Web_Ops_&_Performance
bigdataanalytics
cloudproviders
planningforbigdata
strataconference
from google
IaaS and private clouds
Platform solutions
Big data cloud platforms compared
Conclusion
Big data and cloud technology go hand-in-hand. Big data needs clusters
of servers for processing, which clouds can readily provide. So goes
the marketing message, but what does that look like in reality? Both
"cloud" and "big data" have broad definitions, obscured by considerable
hype. This article breaks down the landscape as simply as possible,
highlighting what's practical, and what's to come.
IaaS and private clouds
What is often called "cloud" amounts to virtualized servers: computing
resource that presents itself as a regular server, rentable per
consumption. This is generally called infrastructure as a service
(IaaS), and is offered by platforms such as Rackspace Cloud or Amazon
EC2. You buy time on these services, and install and configure your
own software, such as a Hadoop cluster or NoSQL database. Most of the
solutions I described in my Big Data Market Survey can be deployed on
IaaS services.
Using IaaS clouds doesn't mean you must handle all deployment
manually: good news for the clusters of machines big data
requires. You can use orchestration frameworks, which handle the
management of resources, and automated infrastructure tools, which
handle server installation and configuration. RightScale offers a
commercial multi-cloud management platform that mitigates some of the
problems of managing servers in the cloud.
Frameworks such as OpenStack and Eucalyptus aim to present a uniform
interface to both private data centers and the public
cloud. Attracting a strong flow of cross industry support, OpenStack
currently addresses computing resource (akin to Amazon's EC2) and
storage (parallels Amazon S3).
The race is on to make private clouds and IaaS services more usable:
over the next two years using clouds should become much more
straightforward as vendors adopt the nascent standards. There'll be a
uniform interface, whether you're using public or private cloud
facilities, or a hybrid of the two.
Particular to big data, several configuration tools already target
Hadoop explicitly: among them Dell's Crowbar, which aims to make
deploying and configuring clusters simple, and Apache Whirr, which is
specialized for running Hadoop services and other clustered data processing systems.
Today, using IaaS gives you a broad choice of cloud supplier, the
option of using a private cloud, and complete control: but you'll be
responsible for deploying, managing and maintaining your clusters.
Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.
Platform solutions
Using IaaS only brings you so far for with big data applications: they
handle the creation of computing and storage resources, but don't
address anything at a higher level. The set up of Hadoop and Hive or a
similar solution is down to you.
Beyond IaaS, several cloud services provide application layer support
for big data work. Sometimes referred to as managed solutions, or
platform as a service (PaaS), these services remove the need to
configure or scale things such as databases or MapReduce, reducing
your workload and maintenance burden. Additionally, PaaS providers can
realize great efficiencies by hosting at the application level, and
pass those savings on to the customer.
The general PaaS market is burgeoning, with major players including
VMware (Cloud Foundry) and Salesforce (Heroku, force.com). As big data
and machine learning requirements percolate through the industry,
these players are likely to add their own big-data-specific
services. For the purposes of this article, though, I will be sticking
to the vendors who already have implemented big data solutions.
Today's primary providers of such big data platform services are
Amazon, Google and Microsoft. You can see their offerings summarized
in the table toward the end of this article. Both Amazon Web Services
and Microsoft's Azure blur the lines between infrastructure as a
service and platform: you can mix and match. By contrast, Google's
philosophy is to skip the notion of a server altogether, and focus
only on the concept of the application. Among these, only Amazon can
lay claim to extensive experience with their product.
Amazon Web Services
Amazon has significant experience in hosting big data processing.
Use of Amazon EC2 for Hadoop was a popular and natural move for many
early adopters of big data, thanks to Amazon's expandable supply of
compute power. Building on this, Amazon launched Elastic Map Reduce
in 2009, providing a hosted, scalable Hadoop service.
Applications on Amazon's platform can pick from the best of both the
IaaS and PaaS worlds. General purpose EC2 servers host applications
that can then access the appropriate special purpose managed solutions
provided by Amazon.
As well as Elastic Map Reduce, Amazon offers several other services
relevant to big data, such as the Simple Queue Service for
coordinating distributed computing, and a hosted relational database service. At the specialist end of big data, Amazon's High Performance Computing solutions are tuned for low-latency cluster computing, of
the sort required by scientific and engineering applications.
Elastic Map Reduce
Elastic Map Reduce (EMR) can be programmed in the usual Hadoop ways,
through Pig, Hive or other programming language, and uses Amazon's S3
storage service to get data in and out.
Access to Elastic Map Reduce is through Amazon's SDKs and tools, or with
GUI analytical and IDE products such as those offered by Karmasphere. In conjunction with these tools, EMR represents a strong
option for experimental and analytical work. Amazon's EMR pricing
makes it a much more attractive option to use EMR, rather than
configure EC2 instances yourself to run Hadoop.
When integrating Hadoop with applications generating structured data,
using S3 as the main data source can be unwieldy. This is because,
similar to Hadoop's HDFS, S3 works at the level of storing blobs
of opaque data. Hadoop's answer to this is HBase, a NoSQL database
that integrates with the rest of the Hadoop stack. Unfortunately,
Amazon does not currently offer HBase with Elastic Map Reduce.
DynamoDB
Instead of HBase, Amazon provides DynamoDB, its own managed, scalable
NoSQL database. As this a managed solution, it represents a better
choice than running your own database on top of EC2, in terms of both
performance and economy.
DynamoDB data can be exported to and imported from S3, providing
interoperability with EMR.
Google's cloud platform stands out as distinct from its
competitors. Rather than offering virtualization, it provides an
application container with defined APIs and services. Developers do
not need to concern themselves with the concept of machines:
applications execute in the cloud, getting access to as much
processing power as they need, within defined resource usage limits.
To use Google's platform, you must work within the constraints of
its APIs. However, if that fits, you can reap the benefits of the
security, tuning and performance improvements inherent to the way
Google develops all its services.
AppEngine, Google's cloud application hosting service, offers a
MapReduce facility for parallel computation over data, but this is
more of a feature for use as part of complex applications rather than
for analytical purposes. Instead, BigQuery and the Prediction API form
the core of Google's big data offering, respectively offering analysis
and machine learning facilities. Both these services are available
exclusively via REST APIs, consistent with Google's vision for
web-based computing.
BigQuery
BigQuery is an analytical database, suitable for interactive analysis
over datasets of the order of 1TB. It works best on a small number of
tables with a large number of rows. BigQuery offers a familiar SQL
interface to its data. In that, it is comparable to Apache Hive, but
the typical performance is faster, making BigQuery a good choice for
exploratory data analysis.
Getting data into BigQuery is a matter of directly uploading it, or
importing it from Google's Cloud Storage system. This is the aspect of
BigQuery with the biggest room for improvement. Whereas Amazon's S3
lets you mail in disks for import, Google doesn't currently have this
facility. Streaming data into BigQuery isn't viable either, so regular
imports are required for constantly updating data. Finally, as
BigQuery only accepts data formatted as comma-separated value (CSV)
files, you will need to use external methods to clean up the data
beforehand.
Rather than provide end-user interfaces itself, Google wants an
ecosystem to grow around BigQuery, with vendors incorporating it into
their products, in the same way Elastic Map Reduce has acquired tool
integration. Currently in beta test, to which anybody can apply,
BigQuery is expected to be publicly available during 2012.
Prediction API
Many uses of machine learning are well defined, such as
classification, sentiment analysis, or recommendation generation. To
meet these needs, Google offers its Prediction API product.
Applications using the Prediction API work by creating and training a
model hosted within Google's system. Once trained, this model can be
used to make predictions, such as spam detection. Google is working
on allowing these models to be shared, optionally with a fee. This
will let you take advantage of previously trained models, which in
many cases will save you time and expertise with training.
Though promising, Google's offerings are in their early days. Further
integration b[…]
february 2012 by rahuldave
Report from HIMSS: health care tries to leap the chasm from the average to the superb
february 2012 by rahuldave
I couldn't attend the session today on StealthVest--and small surprise. Who wouldn't want to come see an Arduino-based garment that can hold numerous health-monitoring devices in a way that is supposed to feel like a completely normal piece of clothing? As with many events at the HIMSS conference, which has registered over 35,000 people (at least four thousand more than last year), the StealthVest presentation drew an overflow crowd.
StealthVest sounds incredibly cool (and I may have another chance to report on it Thursday), but when I gave up on getting into the talk I walked downstairs to a session that sounds kind of boring but may actually be more significant: Practical Application of Control Theory to Improve Capacity in a Clinical Setting.
The speakers on this session, from Banner Gateway Medical Center in Gilbert, Arizona, laid out a fairly standard use of analytics to predict when the hospital units are likely to exceed their capacity, and then to reschedule patients and provider schedules to smooth out the curve. The basic idea comes from chemical engineering, and requires them to monitor all the factors that lead patients to come in to the hospital and that determine how long they stay. Queuing theory can show when things are likely to get tight. Hospitals care a lot about these workflow issues, as Fred Trotter and David Uhlman discuss in the O'Reilly book Beyond Meaningful Use, and they have a real effect on patient care too.
The reason I find this topic interesting is that capacity planning leads fairly quickly to visible cost savings. So hospitals are likely to do it. Furthermore, once they go down the path of collecting long-term data and crunching it, they may extend the practice to clinical decision support, public health reporting, and other things that can make a big difference to patient care.
A few stats about data in U.S. health care
Do we need a big push to do such things? We sure do, and that's why meaningful use was introduced into HITECH sections of the American Recovery and Reinvestment Act. HHS released mounds of government health data on Health.data.gov hoping to serve a similar purpose. Let's just take a look at how far the United States is from using its health data effectively.
Last November, a CompTIA survey (reported by Health Care IT News) found that only 28% of providers have comprehensive EHRs in use, and another 17% have partial implementations. One has to remember that even a "comprehensive" EHR is unlikely to support the sophisticated data mining, information exchange, and process improvement that will eventually lead to lower costs and better care.
According to a recent Beacon Partners survey (PDF), half of the responding institutions have not yet set up an infrastructure for pursuing health information exchange, although 70% consider it a priority. The main problem, according to a HIMSS survey, is budget: HIEs are shockingly expensive. There's more to this story, which I reported on from a recent conference in Massachusetts.
Stats like these have to be considered when HIMSS board chair, Charlene S. Underwood, extolled the organization's achievements in the morning keynote. HIMSS has promoted good causes, but only recently has it addressed cost, interoperability, and open source issues that can allow health IT to break out of the elite of institutions large or sophisticated enough to adopt the right practices.
As signs of change, I am particularly happy to hear of HIMSS's new collaboration with Open Health Tools and their acquisition of the mHealth summit. These should guide the health care field toward more patient engagement and adaptable computer systems. HIEs are another area crying out for change.
An HIE optimist
With the flaccid figures for HIE adoption in mind, I met Charles Parisot, chair of Interoperability Standards and Testing Manager for EHRA, which is HIMSS's Electronic Health Records Association. The biggest EHR vendors and HIEs come together in this association, and Parisot was just stoked with positive stories about their advances.
His take on the cost of HIEs is that most of them just do it in a brute force manner that doesn't work. They actually copy the data from each institution into a central database, which is hard to manage from many standpoints. The HIEs that have done it right (notably in New York state and parts of Tennessee) are sleek and low-cost. The solution involves:
Keeping the data at the health care providers, and storing in the HIE only some glue data that associates the patient and the type of data to the provider.
Keeping all metadata about formats out to the HIE, so that new formats, new codes, and new types of data can easily be introduced into the system without recoding the HIE.
Breaking information exchange down into constituent parts--the data itself, the exchange protocols, identification, standards for encryption and integrity, etc.--and finding standard solutions for each of these.
So EHRA has developed profiles (also known by its ONC term, implementation specifications) that indicate which standard is used for each part of the data exchange. Metadata can be stored in the core HL7 document, the Clinical Document Architecture, and differences between implementations of HL7 documents by different vendors can also be documented.
A view of different architectures in their approach can be found in an EHRA white paper, Supporting a Robust Health Information Exchange Strategy with a Pragmatic Transport Framework. As testament to their success, Parisot claimed that the interoperability lab (a huge part of the exhibit hall floor space, and a popular destination for attendees) could set up the software connecting all the vendors' and HIEs' systems in one hour.
I asked him about the simple email solution promised by the government's Direct project, and whether that may be the path forward for small, cash-strapped providers. He accepted that Direct is part of the solution, but warned that it doesn't make things so simple. Unless two providers have a pre-existing relationship, they need to be part of a directory or even a set of federated directories, and assure their identities through digital signatures.
And what if a large hospital receives hundreds of email messages a day from various doctors who don't even know to whom their patients are being referred? Parisot says metadata must accompany any communications--and he's found that it's more effective for institutions to pull the data they want than for referring physicians to push it.
Intelligence for hospitals
Finally, Parisot told me EHRA has developed standards for submitting data to EHRs from 350 types of devices, and have 50 manufacturers working on devices with these standards. I visited a booth of iSirona as an example. They accept basic monitoring data such as pulses from different systems that use different formats, and translate over 50 items of information into a simple text format that they transmit to an EHR. They also add networking to devices that communicate only over cables. Outlying values can be rejected by a person monitoring the data. The vendor pointed out that format translation will be necessary for some time to come, because neither vendors nor hospitals will replace their devices simply to implement a new data transfer protocol.
For more about devices, I dropped by one of the most entertaining parts of the conference, the Intelligent Hospital Pavilion. Here, after a badge scan, you are somberly led through a series of locked doors into simulated hospital rooms where you get to watch actors in nursing outfits work with lifesize dolls and check innumerable monitors. I think the information overload is barely ameliorated and may be worsened by the arrays of constantly updated screens.
But the background presentation is persuasive: by using attaching RFIDs and all sorts of other devices to everything from people to equipment, and basically making the hospital more like a factory, providers can radically speed up responses in emergency situations and reduce errors. Some devices use the ISM "junk" band, whereas more critical ones use dedicated spectrum. Redundancy is built in throughout the background servers.
Waiting for the main event
The US health care field held their breaths most of last week, waiting for Stage 2 meaningful use guidelines from HHS. The announcement never came, nor did it come this morning as many people had hoped. Because meaningful use is the major theme of HIMSS, and many sessions were planned on helping providers move to Stage 2, the delay in the announcement put the conference in an awkward position.
HIMSS is also nonplussed over a delay in another initiative, the adoption of a new standard in the classification of disease and procedures. ICD-10 is actually pretty old, having been standardized in the 1980s, and the U.S. lags decades behind other countries in adopting it. Advantages touted for ICD-10 are:
It incorporates newer discoveries in medicine than the dominant standard in the U.S., ICD-9, and therefore permits better disease tracking and treatment.
Additionally, it's much more detailed than ICD-9 (with an order of magnitude more classifications). This allows the recording of more information but complicates the job of classifying a patient correctly.
ICD-10 is rather controversial. Some people would prefer to base clinical decisions on SNOMED, a standard described in the Beyond Meaningful Use book mentioned earlier. Ultimately, doctors lobbied hard against the HHS timeline for adopting ICD-10 because providers are so busy with meaningful use. (But of course, the goals of adopting meaningful use are closely tied to the goals of adopting ICD-10.) It was the pushback from these institutions that led HHS to accede and announce a delay. HIMSS and many of its members were disap[…]
Data
Gov_2.0
americanrecoveryandreinvestmentact
arra
ehrs
electronichealthrecords
freesoftware
healthcare
healthit
himss
hitech
interoperability
meaningfuluse
medical
opensource
from google
StealthVest sounds incredibly cool (and I may have another chance to report on it Thursday), but when I gave up on getting into the talk I walked downstairs to a session that sounds kind of boring but may actually be more significant: Practical Application of Control Theory to Improve Capacity in a Clinical Setting.
The speakers on this session, from Banner Gateway Medical Center in Gilbert, Arizona, laid out a fairly standard use of analytics to predict when the hospital units are likely to exceed their capacity, and then to reschedule patients and provider schedules to smooth out the curve. The basic idea comes from chemical engineering, and requires them to monitor all the factors that lead patients to come in to the hospital and that determine how long they stay. Queuing theory can show when things are likely to get tight. Hospitals care a lot about these workflow issues, as Fred Trotter and David Uhlman discuss in the O'Reilly book Beyond Meaningful Use, and they have a real effect on patient care too.
The reason I find this topic interesting is that capacity planning leads fairly quickly to visible cost savings. So hospitals are likely to do it. Furthermore, once they go down the path of collecting long-term data and crunching it, they may extend the practice to clinical decision support, public health reporting, and other things that can make a big difference to patient care.
A few stats about data in U.S. health care
Do we need a big push to do such things? We sure do, and that's why meaningful use was introduced into HITECH sections of the American Recovery and Reinvestment Act. HHS released mounds of government health data on Health.data.gov hoping to serve a similar purpose. Let's just take a look at how far the United States is from using its health data effectively.
Last November, a CompTIA survey (reported by Health Care IT News) found that only 28% of providers have comprehensive EHRs in use, and another 17% have partial implementations. One has to remember that even a "comprehensive" EHR is unlikely to support the sophisticated data mining, information exchange, and process improvement that will eventually lead to lower costs and better care.
According to a recent Beacon Partners survey (PDF), half of the responding institutions have not yet set up an infrastructure for pursuing health information exchange, although 70% consider it a priority. The main problem, according to a HIMSS survey, is budget: HIEs are shockingly expensive. There's more to this story, which I reported on from a recent conference in Massachusetts.
Stats like these have to be considered when HIMSS board chair, Charlene S. Underwood, extolled the organization's achievements in the morning keynote. HIMSS has promoted good causes, but only recently has it addressed cost, interoperability, and open source issues that can allow health IT to break out of the elite of institutions large or sophisticated enough to adopt the right practices.
As signs of change, I am particularly happy to hear of HIMSS's new collaboration with Open Health Tools and their acquisition of the mHealth summit. These should guide the health care field toward more patient engagement and adaptable computer systems. HIEs are another area crying out for change.
An HIE optimist
With the flaccid figures for HIE adoption in mind, I met Charles Parisot, chair of Interoperability Standards and Testing Manager for EHRA, which is HIMSS's Electronic Health Records Association. The biggest EHR vendors and HIEs come together in this association, and Parisot was just stoked with positive stories about their advances.
His take on the cost of HIEs is that most of them just do it in a brute force manner that doesn't work. They actually copy the data from each institution into a central database, which is hard to manage from many standpoints. The HIEs that have done it right (notably in New York state and parts of Tennessee) are sleek and low-cost. The solution involves:
Keeping the data at the health care providers, and storing in the HIE only some glue data that associates the patient and the type of data to the provider.
Keeping all metadata about formats out to the HIE, so that new formats, new codes, and new types of data can easily be introduced into the system without recoding the HIE.
Breaking information exchange down into constituent parts--the data itself, the exchange protocols, identification, standards for encryption and integrity, etc.--and finding standard solutions for each of these.
So EHRA has developed profiles (also known by its ONC term, implementation specifications) that indicate which standard is used for each part of the data exchange. Metadata can be stored in the core HL7 document, the Clinical Document Architecture, and differences between implementations of HL7 documents by different vendors can also be documented.
A view of different architectures in their approach can be found in an EHRA white paper, Supporting a Robust Health Information Exchange Strategy with a Pragmatic Transport Framework. As testament to their success, Parisot claimed that the interoperability lab (a huge part of the exhibit hall floor space, and a popular destination for attendees) could set up the software connecting all the vendors' and HIEs' systems in one hour.
I asked him about the simple email solution promised by the government's Direct project, and whether that may be the path forward for small, cash-strapped providers. He accepted that Direct is part of the solution, but warned that it doesn't make things so simple. Unless two providers have a pre-existing relationship, they need to be part of a directory or even a set of federated directories, and assure their identities through digital signatures.
And what if a large hospital receives hundreds of email messages a day from various doctors who don't even know to whom their patients are being referred? Parisot says metadata must accompany any communications--and he's found that it's more effective for institutions to pull the data they want than for referring physicians to push it.
Intelligence for hospitals
Finally, Parisot told me EHRA has developed standards for submitting data to EHRs from 350 types of devices, and have 50 manufacturers working on devices with these standards. I visited a booth of iSirona as an example. They accept basic monitoring data such as pulses from different systems that use different formats, and translate over 50 items of information into a simple text format that they transmit to an EHR. They also add networking to devices that communicate only over cables. Outlying values can be rejected by a person monitoring the data. The vendor pointed out that format translation will be necessary for some time to come, because neither vendors nor hospitals will replace their devices simply to implement a new data transfer protocol.
For more about devices, I dropped by one of the most entertaining parts of the conference, the Intelligent Hospital Pavilion. Here, after a badge scan, you are somberly led through a series of locked doors into simulated hospital rooms where you get to watch actors in nursing outfits work with lifesize dolls and check innumerable monitors. I think the information overload is barely ameliorated and may be worsened by the arrays of constantly updated screens.
But the background presentation is persuasive: by using attaching RFIDs and all sorts of other devices to everything from people to equipment, and basically making the hospital more like a factory, providers can radically speed up responses in emergency situations and reduce errors. Some devices use the ISM "junk" band, whereas more critical ones use dedicated spectrum. Redundancy is built in throughout the background servers.
Waiting for the main event
The US health care field held their breaths most of last week, waiting for Stage 2 meaningful use guidelines from HHS. The announcement never came, nor did it come this morning as many people had hoped. Because meaningful use is the major theme of HIMSS, and many sessions were planned on helping providers move to Stage 2, the delay in the announcement put the conference in an awkward position.
HIMSS is also nonplussed over a delay in another initiative, the adoption of a new standard in the classification of disease and procedures. ICD-10 is actually pretty old, having been standardized in the 1980s, and the U.S. lags decades behind other countries in adopting it. Advantages touted for ICD-10 are:
It incorporates newer discoveries in medicine than the dominant standard in the U.S., ICD-9, and therefore permits better disease tracking and treatment.
Additionally, it's much more detailed than ICD-9 (with an order of magnitude more classifications). This allows the recording of more information but complicates the job of classifying a patient correctly.
ICD-10 is rather controversial. Some people would prefer to base clinical decisions on SNOMED, a standard described in the Beyond Meaningful Use book mentioned earlier. Ultimately, doctors lobbied hard against the HHS timeline for adopting ICD-10 because providers are so busy with meaningful use. (But of course, the goals of adopting meaningful use are closely tied to the goals of adopting ICD-10.) It was the pushback from these institutions that led HHS to accede and announce a delay. HIMSS and many of its members were disap[…]
february 2012 by rahuldave
Book marketing is broken. Big data can fix it
february 2012 by rahuldave
Peter Collingridge (@gunzalis), cofounder of Enhanced Editions says digital books are requiring a new style of data-driven marketing and promotion that publishers aren't yet implementing. He also says that book marketing is broken and big data is the solution.
In the following interview, Collingridge talks about how real-time data and analytics can help publishers and he shares insights from the beta period of Bookseer, a market intelligence service for books his company is developing.
What are some key findings from the Bookseer beta?
Peter Collingridge: I think despite the increasing awareness of data as being a critical tool for publishers to compete, it's genuinely hard for people to look at data as a natural addition to the work they are doing, whether that's in PR, marketing, acquisition, or pricing.
Publishing has operated in a well-defined way for a long time, where experience and intuition have dominated decision making and change is hard. What has been really exciting is that when people have the data in front of them, clearly showing the immediate impact of something they did — a link between cause and effect that they couldn't see before — they get really excited. We've had people talking about being "obsessed" and "addicted" to the data.
Some of the most surprising findings: That on some titles, big price changes aren't as relevant to volume as everyone thinks; that big-name glowing reviews of literary fiction don't have anywhere near the impact on sales to merit the effort; and that social media buzz almost never translates into sales.
For me, the key observations so far are around marketing. First, big budget media spending and ostentatious banner ads might impress authors and bookshops, but they deliver very poor return on investment (ROI) for sales. Secondly, the super-smart publishers are behaving like startups and doing tiny little pieces of very focused and cheap marketing — and watching the results like hawks before iterating in direct response to the data. Bookseer is designed to disclose the former and to aid the latter — and that is probably our biggest finding: it works!
Find out more about Bookseer in the following video from the If Book Then conference earlier this year in Milan.
What kinds of data are most important for publishers to track?
Peter Collingridge: Before we built Bookseer, we spoke with 25 people across the industry, including authors big, small and unpublished; editors and publishers; managing directors; digital directors; sales, marketing and PR directors; and literary agents. We asked exactly that question.
For most people, the data they had was pretty basic: Nielsen (which obviously only goes to the granularity of one week) plus the F5 button to manically refresh an Amazon web page for changes in sales rank. Neither of these is particularly helpful in determining the impact of an activity.
Of course, there are loads of data points, but we began with the lowest-hanging fruit. Aggregated sales (print and digital) across multiple sources; Amazon sales rank; price; best-seller charts; social media mentions; buzz; review coverage in mainstream and new media, and on social reading sites; and other factors such as promotion (advertising and other) and merchandising.
We think the most important thing to do is aggregate activity and data points across as many sources as possible, building a picture of what's going on for one title or across a whole retailer, and allowing publishers to draw their own conclusions.
What does real-time data let publishers do?
Peter Collingridge: Publishing has been B2B, about supplying books into bookshops, for forever — combined with working with media to support that. And for that world, weekly aggregated retail sales work, I guess. But when you're in a much faster-paced world, with the industry moving toward being consumer- rather than trade-facing, and with a fragmented retail and media landscape, you need to make decisions based on fact: What is the ROI on a £50,000 marketing campaign? Where do my banner ads have the best CTR? Who are the key influencers here — are they bloggers, mainstream media, or somewhere else? How many of our Twitter followers actually engage? When should we publish, in what format, and at what price?
Data should absolutely inform the answers to these questions. Furthermore, with a disciplined approach to promotion, where activities are separated from each other by a day or a few hours, real-time measurement can identify what works and what doesn't. We can identify the difference between Al Gore tweeting about a book and Tim O'Reilly doing the same; the difference between a Time review and a piece on CNN; the impact of a price drop against an email sent to 200,000 subscribers; and measure the exact ROI on a £300 campaign against a £30,000 one.
Over time, you build up a picture of which tactics work best and which don't. And immediate feedback allows you to hone your activities in real-time to what works best (particularly if you are A/B testing different approaches), or from a more strategic perspective, to plan out campaigns that have historically worked best for comparable titles.
How would you describe the relationship between sales and social media?
Peter Collingridge: Right now, sales drives social — not the other way round. However, I believe there will come a point when that's not the case, and we will be able to identify that.
This interview was edited and condensed.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.
Save 20% on registration with the code RADAR20
Related:
Publishers: What are they good for?
Book Publicity and Marketing: How Soon was Yesterday?e
Three reasons why we're in a golden age of publishing entrepreneurship
Social data is an oracle waiting for a question
Data
Publishing
bigdata
bookmarketing
marketintelligence
publishinganalytics
publishingdata
from google
In the following interview, Collingridge talks about how real-time data and analytics can help publishers and he shares insights from the beta period of Bookseer, a market intelligence service for books his company is developing.
What are some key findings from the Bookseer beta?
Peter Collingridge: I think despite the increasing awareness of data as being a critical tool for publishers to compete, it's genuinely hard for people to look at data as a natural addition to the work they are doing, whether that's in PR, marketing, acquisition, or pricing.
Publishing has operated in a well-defined way for a long time, where experience and intuition have dominated decision making and change is hard. What has been really exciting is that when people have the data in front of them, clearly showing the immediate impact of something they did — a link between cause and effect that they couldn't see before — they get really excited. We've had people talking about being "obsessed" and "addicted" to the data.
Some of the most surprising findings: That on some titles, big price changes aren't as relevant to volume as everyone thinks; that big-name glowing reviews of literary fiction don't have anywhere near the impact on sales to merit the effort; and that social media buzz almost never translates into sales.
For me, the key observations so far are around marketing. First, big budget media spending and ostentatious banner ads might impress authors and bookshops, but they deliver very poor return on investment (ROI) for sales. Secondly, the super-smart publishers are behaving like startups and doing tiny little pieces of very focused and cheap marketing — and watching the results like hawks before iterating in direct response to the data. Bookseer is designed to disclose the former and to aid the latter — and that is probably our biggest finding: it works!
Find out more about Bookseer in the following video from the If Book Then conference earlier this year in Milan.
What kinds of data are most important for publishers to track?
Peter Collingridge: Before we built Bookseer, we spoke with 25 people across the industry, including authors big, small and unpublished; editors and publishers; managing directors; digital directors; sales, marketing and PR directors; and literary agents. We asked exactly that question.
For most people, the data they had was pretty basic: Nielsen (which obviously only goes to the granularity of one week) plus the F5 button to manically refresh an Amazon web page for changes in sales rank. Neither of these is particularly helpful in determining the impact of an activity.
Of course, there are loads of data points, but we began with the lowest-hanging fruit. Aggregated sales (print and digital) across multiple sources; Amazon sales rank; price; best-seller charts; social media mentions; buzz; review coverage in mainstream and new media, and on social reading sites; and other factors such as promotion (advertising and other) and merchandising.
We think the most important thing to do is aggregate activity and data points across as many sources as possible, building a picture of what's going on for one title or across a whole retailer, and allowing publishers to draw their own conclusions.
What does real-time data let publishers do?
Peter Collingridge: Publishing has been B2B, about supplying books into bookshops, for forever — combined with working with media to support that. And for that world, weekly aggregated retail sales work, I guess. But when you're in a much faster-paced world, with the industry moving toward being consumer- rather than trade-facing, and with a fragmented retail and media landscape, you need to make decisions based on fact: What is the ROI on a £50,000 marketing campaign? Where do my banner ads have the best CTR? Who are the key influencers here — are they bloggers, mainstream media, or somewhere else? How many of our Twitter followers actually engage? When should we publish, in what format, and at what price?
Data should absolutely inform the answers to these questions. Furthermore, with a disciplined approach to promotion, where activities are separated from each other by a day or a few hours, real-time measurement can identify what works and what doesn't. We can identify the difference between Al Gore tweeting about a book and Tim O'Reilly doing the same; the difference between a Time review and a piece on CNN; the impact of a price drop against an email sent to 200,000 subscribers; and measure the exact ROI on a £300 campaign against a £30,000 one.
Over time, you build up a picture of which tactics work best and which don't. And immediate feedback allows you to hone your activities in real-time to what works best (particularly if you are A/B testing different approaches), or from a more strategic perspective, to plan out campaigns that have historically worked best for comparable titles.
How would you describe the relationship between sales and social media?
Peter Collingridge: Right now, sales drives social — not the other way round. However, I believe there will come a point when that's not the case, and we will be able to identify that.
This interview was edited and condensed.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.
Save 20% on registration with the code RADAR20
Related:
Publishers: What are they good for?
Book Publicity and Marketing: How Soon was Yesterday?e
Three reasons why we're in a golden age of publishing entrepreneurship
Social data is an oracle waiting for a question
february 2012 by rahuldave
Why data visualization matters
february 2012 by rahuldave
Let's say you need to understand thousands or even millions of rows
of data, and you have a short time to do it in. The data may come
from your team, in which case perhaps you're already familiar
with what it's measuring and what the results are likely to be. Or it
may come from another team, or maybe several teams at
once, and be completely unfamiliar. Either way, the reason you're
looking at it is that you have a decision to make, and you want to be
informed by the data before making it. Something probably hangs in
the balance: a customer, a product, or a profit.
How are you going to make sense of all that information efficiently
so you can make a good decision? Data visualization is an important
answer to that question.
However, not all visualizations are actually that helpful. You may
be all too familiar with lifeless bar graphs, or line graphs made with
software defaults and couched in a slideshow presentation or lengthy
document. They can be at best confusing, and at worst misleading. But
the good ones are an absolute revelation.
The best data visualizations are ones that expose something
new about the underlying patterns and relationships contained
within the data. Understanding those relationships — and being
able to observe them — is key to good decision making. The
Periodic Table is a classic testament to the potential of
visualization to reveal hidden relationships in even small datasets.
One look at the table, and chemists and middle school students alike
grasp the way atoms arrange themselves in groups: alkali metals, noble
gasses, halogens.
If visualization done right can reveal so much in even a small dataset like this, imagine what it can reveal within terabytes or petabytes of information.
Types of visualization
It's important to point out that not all data visualization is
created equal. Just as we have paints and pencils and chalk and film
to help us capture the world in different ways, with different
emphases and for different purposes, there are multiple ways in which
to depict the same dataset.
Or, to put it another way, think of visualization as a new set of
languages you can use to communicate. Just as French and Russian and
Japanese are all ways of encoding ideas so that those ideas can be
transported from one person's mind to another, and decoded
again — and just as certain languages are more conducive to
certain ideas — so the various kinds of data visualization are a
kind of bidirectional encoding that lets ideas and
information be transported from the database into your brain.
Explaining and exploring
An important distinction lies between visualization for
exploring and visualization for explaining. A third
category, visual art, comprises images that encode data but
cannot easily be decoded back to the original meaning by a
viewer. This kind of visualization can be beautiful, but it is not
helpful in making decisions.
Visualization for exploring can be imprecise. It's useful when
you're not exactly sure what the data has to tell you and you're
trying to get a sense of the relationships and patterns contained
within it for the first time. It may take a while to figure out how
to approach or clean the data, and which dimensions to include.
Therefore, visualization for exploring is best done in such a way that
it can be iterated quickly and experimented upon, so that you can find
the signal within the noise. Software and automation are your friends
here.
Visualization for explaining is best when it is cleanest. Here, the
ability to pare down the information to its simplest form — to
strip away the noise entirely — will increase the efficiency with
which a decision maker can understand it. This is the approach to
take once you understand what the data is telling you, and you want to
communicate that to someone else. This is the kind of visualization
you should be finding in those presentations and sales
reports.
Visualization for explaining also includes infographics and other
categories of hand-drawn or custom-made images. Automated tools can be
used, but one size does not fit all.
Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.
Your customers make decisions, too
While data visualization is a powerful tool for helping you and
others within your organization make better decisions, it's important
to remember that, in the meantime, your customers are trying to decide
between you and your competitors. Many kinds of data visualization,
from complex interactive or animated graphs to brightly-colored
infographics, can help your customers explore and your customer service folks explain.
That's why all kinds of companies and organizations, from GE to Trulia to NASA, are beginning to invest
significant resources in providing interactive visualizations to their
customers and the public. This allows viewers to better understand the
company's business, and interact in a self-directed manner with the
company's expertise.
As big data becomes bigger, and more companies deal with complex
datasets with dozens of variables, data visualization will become
even more important. So far, the tide of popularity has risen more
quickly than the tide of visual literacy, and mediocre efforts abound,
in presentations and on the web.
But as visual literacy rises, thanks in no small part to impressive
efforts in major media such as The New York
Times and The
Guardian, data visualization will increasingly become a language
your customers and collaborators expect you to speak — and speak
well.
Do yourself a favor and hire a designer
It's well worth investing in a talented in-house
designer, or a team of designers. Visualization for explaining works
best when someone who understands not only the data itself, but also
the principles of design and visual communication, tailors the graph or
chart to the message.
Whether it's text or visuals, important translations require more than basic tools.
To go back to the language analogy: Google Translate is a powerful
and useful tool for giving you the general idea of what a foreign text
says. But it's not perfect, and it often lacks nuance. For getting
the overall gist of things, it's great. But I wouldn't use it to send
a letter to a foreign ambassador. For something so sensitive, and
where precision counts, it's worth hiring an experienced human
translator.
Since data visualization is like a foreign language, in the same
way, hire an experienced designer for important jobs where
precision matters. If you're making the kinds of decisions in which
your customer, product, or profit hangs in the balance, you can't
afford to base those decisions on incomplete or misleading
representations of the knowledge your company holds.
Your designer is your translator, and one of the most important
links you and your customers have to your data.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.
Save 20% on registration with the code RADAR20
Related:
Beautiful Visualization: Looking at Data through the Eyes of Experts (book)
When judging visualizations, intent matters
Redesigning the New York City subway map
Visualization Deconstructed series
Visualization of the Week series
Data
businessintelligence
datadecisions
datapredictions
dataproduct
datatools
planningforbigdata
visualization
from google
of data, and you have a short time to do it in. The data may come
from your team, in which case perhaps you're already familiar
with what it's measuring and what the results are likely to be. Or it
may come from another team, or maybe several teams at
once, and be completely unfamiliar. Either way, the reason you're
looking at it is that you have a decision to make, and you want to be
informed by the data before making it. Something probably hangs in
the balance: a customer, a product, or a profit.
How are you going to make sense of all that information efficiently
so you can make a good decision? Data visualization is an important
answer to that question.
However, not all visualizations are actually that helpful. You may
be all too familiar with lifeless bar graphs, or line graphs made with
software defaults and couched in a slideshow presentation or lengthy
document. They can be at best confusing, and at worst misleading. But
the good ones are an absolute revelation.
The best data visualizations are ones that expose something
new about the underlying patterns and relationships contained
within the data. Understanding those relationships — and being
able to observe them — is key to good decision making. The
Periodic Table is a classic testament to the potential of
visualization to reveal hidden relationships in even small datasets.
One look at the table, and chemists and middle school students alike
grasp the way atoms arrange themselves in groups: alkali metals, noble
gasses, halogens.
If visualization done right can reveal so much in even a small dataset like this, imagine what it can reveal within terabytes or petabytes of information.
Types of visualization
It's important to point out that not all data visualization is
created equal. Just as we have paints and pencils and chalk and film
to help us capture the world in different ways, with different
emphases and for different purposes, there are multiple ways in which
to depict the same dataset.
Or, to put it another way, think of visualization as a new set of
languages you can use to communicate. Just as French and Russian and
Japanese are all ways of encoding ideas so that those ideas can be
transported from one person's mind to another, and decoded
again — and just as certain languages are more conducive to
certain ideas — so the various kinds of data visualization are a
kind of bidirectional encoding that lets ideas and
information be transported from the database into your brain.
Explaining and exploring
An important distinction lies between visualization for
exploring and visualization for explaining. A third
category, visual art, comprises images that encode data but
cannot easily be decoded back to the original meaning by a
viewer. This kind of visualization can be beautiful, but it is not
helpful in making decisions.
Visualization for exploring can be imprecise. It's useful when
you're not exactly sure what the data has to tell you and you're
trying to get a sense of the relationships and patterns contained
within it for the first time. It may take a while to figure out how
to approach or clean the data, and which dimensions to include.
Therefore, visualization for exploring is best done in such a way that
it can be iterated quickly and experimented upon, so that you can find
the signal within the noise. Software and automation are your friends
here.
Visualization for explaining is best when it is cleanest. Here, the
ability to pare down the information to its simplest form — to
strip away the noise entirely — will increase the efficiency with
which a decision maker can understand it. This is the approach to
take once you understand what the data is telling you, and you want to
communicate that to someone else. This is the kind of visualization
you should be finding in those presentations and sales
reports.
Visualization for explaining also includes infographics and other
categories of hand-drawn or custom-made images. Automated tools can be
used, but one size does not fit all.
Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.
Your customers make decisions, too
While data visualization is a powerful tool for helping you and
others within your organization make better decisions, it's important
to remember that, in the meantime, your customers are trying to decide
between you and your competitors. Many kinds of data visualization,
from complex interactive or animated graphs to brightly-colored
infographics, can help your customers explore and your customer service folks explain.
That's why all kinds of companies and organizations, from GE to Trulia to NASA, are beginning to invest
significant resources in providing interactive visualizations to their
customers and the public. This allows viewers to better understand the
company's business, and interact in a self-directed manner with the
company's expertise.
As big data becomes bigger, and more companies deal with complex
datasets with dozens of variables, data visualization will become
even more important. So far, the tide of popularity has risen more
quickly than the tide of visual literacy, and mediocre efforts abound,
in presentations and on the web.
But as visual literacy rises, thanks in no small part to impressive
efforts in major media such as The New York
Times and The
Guardian, data visualization will increasingly become a language
your customers and collaborators expect you to speak — and speak
well.
Do yourself a favor and hire a designer
It's well worth investing in a talented in-house
designer, or a team of designers. Visualization for explaining works
best when someone who understands not only the data itself, but also
the principles of design and visual communication, tailors the graph or
chart to the message.
Whether it's text or visuals, important translations require more than basic tools.
To go back to the language analogy: Google Translate is a powerful
and useful tool for giving you the general idea of what a foreign text
says. But it's not perfect, and it often lacks nuance. For getting
the overall gist of things, it's great. But I wouldn't use it to send
a letter to a foreign ambassador. For something so sensitive, and
where precision counts, it's worth hiring an experienced human
translator.
Since data visualization is like a foreign language, in the same
way, hire an experienced designer for important jobs where
precision matters. If you're making the kinds of decisions in which
your customer, product, or profit hangs in the balance, you can't
afford to base those decisions on incomplete or misleading
representations of the knowledge your company holds.
Your designer is your translator, and one of the most important
links you and your customers have to your data.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.
Save 20% on registration with the code RADAR20
Related:
Beautiful Visualization: Looking at Data through the Eyes of Experts (book)
When judging visualizations, intent matters
Redesigning the New York City subway map
Visualization Deconstructed series
Visualization of the Week series
february 2012 by rahuldave
The bond between data and journalism grows stronger
february 2012 by rahuldave
While reporters and editors have been the traditional vectors for information gathering and dissemination, the flattened information environment of 2012 now has news breaking first online, not on the newsdesk.
That doesn't mean that the integrated media organizations of today don't play a crucial role. Far from it. In the information age, journalists are needed more than ever to curate, verify, analyze and synthesize the wash of data.
To learn more about the shifting world of data journalism, I interviewed Liliana Bounegru (@bb_liliana), project coordinator of SYNC3, the first international Data Journalism Awards, and Data Driven Journalism at
the European Journalism Centre.
What's the difference between the data journalism of today and the computer-assisted reporting (CAR) of the past?
Liliana Bounegru: There is a "continuity and change" debate going on around the label "data journalism" and its relationship with previous journalistic practices that employ
computational techniques to analyze datasets.
Some
argue [PDF] that there is a difference between CAR and data
journalism. They say that CAR is a technique for gathering and analyzing data as a way of enhancing (usually investigative) reportage, whereas data journalism pays attention to the way that data
sits within the whole journalistic workflow. In this sense, data journalism pays equal attention to finding stories and to the data itself. Hence, we find the Guardian Datablog
or the Texas Tribune publishing datasets
alongside stories, or even just datasets by themselves for people to
analyze and explore.
Another difference is that in the past, investigative reporters
would suffer from a poverty of information relating to a question they
were trying to answer or an issue that they were trying to address.
While this is, of course, still the case, there is also an overwhelming
abundance of information that journalists don't necessarily know what
to do with. They don't know how to get value out of data. As Philip Meyer recently wrote to
me: "When information was scarce, most of our efforts were devoted to
hunting and gathering. Now that information is abundant, processing is
more important."
On the other hand, some argue that there is no difference between
data journalism and computer-assisted reporting. It is by now common
sense that even the most recent media practices have histories as
well as something new in them. Rather than debating whether or not
data journalism is completely novel, a more fruitful position would be
to consider it as part of a longer tradition but responding to new
circumstances and conditions. Even if there might not be a difference
in goals and techniques, the emergence of the label "data journalism"
at the beginning of the century indicates a new phase wherein the
sheer volume of data that is freely available online combined with
sophisticated user-centric tools enables more people to work with more
data more easily than ever before. Data journalism is about mass data
literacy.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.
Save 20% on registration with the code RADAR20
What does data journalism mean for the future of journalism? Are
there new business models here?
Liliana Bounegru: There are all kinds of
interesting new business models emerging with data journalism. Media
companies are becoming increasingly innovative with the way they
produce revenues, moving away from subscription-based models and
advertising to offering consultancy services, as in the case of the
German award-winning OpenDataCity.
Digital technologies and the web are fundamentally changing the way
we do journalism. Data journalism is one part in the ecosystem of
tools and practices that have sprung up around data sites and
services. Quoting and sharing source materials (structured data) is in
the nature of the hyperlink structure of the web and in the way we are
accustomed to navigating information today. By enabling anyone to
drill down into data sources and find information that is relevant to
them as individuals or to their community, as well as to do fact
checking, data journalism provides a much needed service coming from a
trustworthy source. Quoting and linking to data sources is specific
to data journalism at the moment, but seamless integration of data in
the fabric of media is increasingly the direction journalism is going
in the future. As Tim
Berners-Lee says, "data-driven journalism is the future".
What data-driven journalism initiatives have caught your attention?
Liliana Bounegru: The data journalism project FarmSubsidy.org is one of my
favorites. It addresses a real problem: The European Union (EU) is
spending 48% of its budget on agriculture subsidies, yet the money
doesn't reach those who need it.
Tracking payments and recipients of agriculture subsidies from the
European Union to all member states is a difficult task. The data is
scattered in different places in different formats, with some missing
and some scanned in from paper records. It is hard to piece it
together to form a comprehensive picture of how funds are distributed.
The project not only made the data available to anyone in an easy to
understand way, but it also advocated for policy changes and better
transparency laws.
Another of my favorite examples is the LRA Crisis Tracker, a
real-time crisis mapping platform and data collection system. The
tracker makes information about the attacks and movements of the Lord's
Resistance Army (LRA) in Africa publicly available. It helps to inform
local communities, as well as the organizations that support
the affected communities, about the activities of the LRA through an
early-warning radio network in order to reduce their response time to
incidents.
I am also a big fan of much of the work done by the Guardian Datablog.
You can find lots of other examples featured on datadrivenjournalism.net,
along with interviews, case studies and tutorials.
I've talked to people like Chicago Tribune news app developer
Brian Boyer about the emerging "newsroom
stack." What do you feel are the key tools of the data
journalist?
Liliana Bounegru: Experienced data journalists
list spreadsheets as a top data journalism tool. Open source tools and
web-based applications for data cleaning, analysis and visualization
play very important roles in finding and presenting data stories. I
have been involved in organizing several workshops on ScraperWiki and Google Refine for
data collection and analysis. We found that participants were quite
able to quickly ask and answer new kinds of questions with these
tools.
How does data journalism relate to open data and open government?
Liliana Bounegru: Open government data means that
more people can access and reuse official information published by
government bodies. This in itself is not enough. It is increasingly
important that journalists can keep up and are equipped with skills
and resources to understand open government data. Journalists need to
know what official data means, what it says and what it leaves out.
They need to know what kind of picture is being presented of an
issue.
Public bodies are very experienced in presenting data to the public
in support of official policies and practices. Journalists, however,
will often not have this level of literacy. Only by equipping
journalists with the skills to use data more effectively can we break
the current asymmetry, where our understanding of the information that
matters is mediated by governments, companies and other experts. In a
nutshell, open data advocates push for more data, and data journalists
help the public to use, explore and evaluate it.
This interview has been edited and condensed for clarity.
Photo on associated home and category pages: NYTimes: 365/360 - 1984 (in color) by blprnt_van, on Flickr.
Related:
International Data Journalism Awards Recognize a Crucial Field
The
growing importance of data journalism
Data
journalism and the emerging newsroom stack
The work of data journalism: Find, clean, analyze, create ... repeat
Data
Gov_2.0
Publishing
bigdata
datajournalism
dataprojects
datascience
governmentit
opendata
opengovernmentdata
from google
That doesn't mean that the integrated media organizations of today don't play a crucial role. Far from it. In the information age, journalists are needed more than ever to curate, verify, analyze and synthesize the wash of data.
To learn more about the shifting world of data journalism, I interviewed Liliana Bounegru (@bb_liliana), project coordinator of SYNC3, the first international Data Journalism Awards, and Data Driven Journalism at
the European Journalism Centre.
What's the difference between the data journalism of today and the computer-assisted reporting (CAR) of the past?
Liliana Bounegru: There is a "continuity and change" debate going on around the label "data journalism" and its relationship with previous journalistic practices that employ
computational techniques to analyze datasets.
Some
argue [PDF] that there is a difference between CAR and data
journalism. They say that CAR is a technique for gathering and analyzing data as a way of enhancing (usually investigative) reportage, whereas data journalism pays attention to the way that data
sits within the whole journalistic workflow. In this sense, data journalism pays equal attention to finding stories and to the data itself. Hence, we find the Guardian Datablog
or the Texas Tribune publishing datasets
alongside stories, or even just datasets by themselves for people to
analyze and explore.
Another difference is that in the past, investigative reporters
would suffer from a poverty of information relating to a question they
were trying to answer or an issue that they were trying to address.
While this is, of course, still the case, there is also an overwhelming
abundance of information that journalists don't necessarily know what
to do with. They don't know how to get value out of data. As Philip Meyer recently wrote to
me: "When information was scarce, most of our efforts were devoted to
hunting and gathering. Now that information is abundant, processing is
more important."
On the other hand, some argue that there is no difference between
data journalism and computer-assisted reporting. It is by now common
sense that even the most recent media practices have histories as
well as something new in them. Rather than debating whether or not
data journalism is completely novel, a more fruitful position would be
to consider it as part of a longer tradition but responding to new
circumstances and conditions. Even if there might not be a difference
in goals and techniques, the emergence of the label "data journalism"
at the beginning of the century indicates a new phase wherein the
sheer volume of data that is freely available online combined with
sophisticated user-centric tools enables more people to work with more
data more easily than ever before. Data journalism is about mass data
literacy.
Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.
Save 20% on registration with the code RADAR20
What does data journalism mean for the future of journalism? Are
there new business models here?
Liliana Bounegru: There are all kinds of
interesting new business models emerging with data journalism. Media
companies are becoming increasingly innovative with the way they
produce revenues, moving away from subscription-based models and
advertising to offering consultancy services, as in the case of the
German award-winning OpenDataCity.
Digital technologies and the web are fundamentally changing the way
we do journalism. Data journalism is one part in the ecosystem of
tools and practices that have sprung up around data sites and
services. Quoting and sharing source materials (structured data) is in
the nature of the hyperlink structure of the web and in the way we are
accustomed to navigating information today. By enabling anyone to
drill down into data sources and find information that is relevant to
them as individuals or to their community, as well as to do fact
checking, data journalism provides a much needed service coming from a
trustworthy source. Quoting and linking to data sources is specific
to data journalism at the moment, but seamless integration of data in
the fabric of media is increasingly the direction journalism is going
in the future. As Tim
Berners-Lee says, "data-driven journalism is the future".
What data-driven journalism initiatives have caught your attention?
Liliana Bounegru: The data journalism project FarmSubsidy.org is one of my
favorites. It addresses a real problem: The European Union (EU) is
spending 48% of its budget on agriculture subsidies, yet the money
doesn't reach those who need it.
Tracking payments and recipients of agriculture subsidies from the
European Union to all member states is a difficult task. The data is
scattered in different places in different formats, with some missing
and some scanned in from paper records. It is hard to piece it
together to form a comprehensive picture of how funds are distributed.
The project not only made the data available to anyone in an easy to
understand way, but it also advocated for policy changes and better
transparency laws.
Another of my favorite examples is the LRA Crisis Tracker, a
real-time crisis mapping platform and data collection system. The
tracker makes information about the attacks and movements of the Lord's
Resistance Army (LRA) in Africa publicly available. It helps to inform
local communities, as well as the organizations that support
the affected communities, about the activities of the LRA through an
early-warning radio network in order to reduce their response time to
incidents.
I am also a big fan of much of the work done by the Guardian Datablog.
You can find lots of other examples featured on datadrivenjournalism.net,
along with interviews, case studies and tutorials.
I've talked to people like Chicago Tribune news app developer
Brian Boyer about the emerging "newsroom
stack." What do you feel are the key tools of the data
journalist?
Liliana Bounegru: Experienced data journalists
list spreadsheets as a top data journalism tool. Open source tools and
web-based applications for data cleaning, analysis and visualization
play very important roles in finding and presenting data stories. I
have been involved in organizing several workshops on ScraperWiki and Google Refine for
data collection and analysis. We found that participants were quite
able to quickly ask and answer new kinds of questions with these
tools.
How does data journalism relate to open data and open government?
Liliana Bounegru: Open government data means that
more people can access and reuse official information published by
government bodies. This in itself is not enough. It is increasingly
important that journalists can keep up and are equipped with skills
and resources to understand open government data. Journalists need to
know what official data means, what it says and what it leaves out.
They need to know what kind of picture is being presented of an
issue.
Public bodies are very experienced in presenting data to the public
in support of official policies and practices. Journalists, however,
will often not have this level of literacy. Only by equipping
journalists with the skills to use data more effectively can we break
the current asymmetry, where our understanding of the information that
matters is mediated by governments, companies and other experts. In a
nutshell, open data advocates push for more data, and data journalists
help the public to use, explore and evaluate it.
This interview has been edited and condensed for clarity.
Photo on associated home and category pages: NYTimes: 365/360 - 1984 (in color) by blprnt_van, on Flickr.
Related:
International Data Journalism Awards Recognize a Crucial Field
The
growing importance of data journalism
Data
journalism and the emerging newsroom stack
The work of data journalism: Find, clean, analyze, create ... repeat
february 2012 by rahuldave
How to create a visualization
february 2012 by rahuldave
Over the last few years I've created a few popular visualizations, a lot of duds, and I've learned a few lessons along the way. For my latest analysis of where Facebook users go on vacation, I decided to document the steps I follow to build my visualizations . It's a very rough guide, these are just stages I've learned to follow by trial and error, but following these guidelines is a good way to start if you're looking to create your first visualization.
Play with your data
I was lucky enough to spend a few hours with Andreas Weigend recently, head of the Stanford Social Data lab. He has nine rules of data, and the first is "Start with the problem, not the data." What struck me about visualizations is that I actually take the opposite approach. I find the only way to begin is to explore what information is available and get a feeling for what stories it can tell.
In my case, we have a Cassandra cluster with information on more than 350 million photos shared on Facebook. I've been running Pig analytics jobs regularly to get a view of what we have in there. One of the reports we generate is a count of how many photos and users we have for particular places:
Click to enlarge.
I was chatting with my colleague Chris Raynor about this, and he asked me if we could tell where all the visitors to those places were coming from. This was something that had been at the back of my mind for a long time. Seeing how much information we had on each destination made me realize we had enough data to produce significant and meaningful answers.
When I was learning engineering, one of my favorite case studies was an investigation into an air-traffic control system. Software engineers couldn't understand why fully-computerized control rooms were actually less efficient and safe than more old-fashioned sites. What the researchers discovered was that the old process of passing around and arranging small cards that each represented a plane gave controllers a much stronger awareness of the situation than a screen that didn't require their involvement for tasks, such as handing an aircraft to a colleague. I think the same is true of data. The more time you spend manipulating and examining the raw information, the more you understand it at a deep level. Knowing your data is the essential starting point for any visualization.
Pick a question
Now that I had a rough idea for what I wanted to visualize, I really needed to focus on what I would be doing. The best way to do that is to chose the exact title you want to give your visualization. I actually messed this up on one early map I created, giving the blog post the title "How to split up the US." Everyone subsequently described it as "The Five Nations of Facebook." Since then, I've tried very hard to pick the most natural title for what I'm going to be presenting, and then ensure I can deliver on the promise of the headline.
In this case I had a clear idea of the question at the start, it was going to be "Where do people go on vacation?". However, as I thought about it, I realized it needed to be a lot more specific and concrete. There's already a lot of "top travel destinations" lists out there, so what made mine different? It was the use of Facebook to gather much richer and more detailed information, so I refined it to "Where do Facebook users go on vacation?".
Sketch out your presentation
I now had the data and a question I wanted to answer. The next step was figuring out how to show the information in a visual form. I'm in love with network diagrams showing connections between thousands of objects, but so often they are completely baffling to the rest of the world. I still remember David Cohen threatening to strangle me if I showed him another one of "those damn spider webs" instead of a business plan. However, network diagrams are a good way of hinting at how much data is available for querying; they can really give an idea of the sheer scale of what's there.
One of my favorite recent visualizations was Paul Butler's map of friendships on Facebook, so I decided to use that as a visual reference:
See the full version of Paul Butler's "Visualizing Friendships" visualization.
I borrowed a couple of key ideas from his work: the general color palette of the blue lines on a dark background and the use of great circles to create flowing arcs for all connections.
As I thought about the presentation, I realized that I had to simplify what it would be showing. With sources and destinations plotted all over the world, both the visual look and the querying interface would be overwhelming. Our user-base is primarily American thanks to our reliance on English-only natural language processing, so with that in mind I decided to make life simpler by only showing data from people who lived in the U.S. Accordingly, I changed the question in my title to "Where do American Facebook users go on vacation?".
While I'm mostly presenting this as a linear, waterfall process, what I've just described is a good example of how iterative cycles drive the real workflow. It's hard to know how well a lot of things will work until you try them. As you're still making some progress, don't worry if you find yourself going in circles.
Crunch the data
If you know your data, and you have a good idea of the question you're trying to answer, this should be the simplest stage. You'll hopefully have a clear set of requirements and it's just a matter of executing the right queries over your data.
In this case I already had some Pig scripts asking similar questions, so I was able to adapt one of those. The biggest surprise was when I ran into issues with some of the joins. The hard part was running the Hadoop job to gather the raw data from our Cassandra cluster, and that worked. I was able to output smaller files containing the gathered data, and then run a local Pig job to do the joins I needed.
The next stage was turning the raw information into a form that could be displayed. For example, I needed to take all of the user locations from the unstructured text strings that Facebook gave me, and convert them into latitude-longitude coordinates for plotting on a map. For this sort of work I usually turn to a general-purpose scripting language, and most of Jetpac is already written in Ruby, so that was an easy choice. I wrote a script that walked through the data, using the Data Science Toolkit to match coordinates with names, and then output it into a file containing a JSON array of all the information.
Build an interface
A lot of the best visualizations have no interactivity. They just tell a story with a static image. That's why it's worth considering whether you need an interface at all. I actually had the interactive site that I used to create the "Five Nations of Facebook" visualization up for several weeks before that post, and nobody used it because it was too confusing. It was only when I boiled it down into a single picture with labels that it became a hit.
My problem is that I want other people to have as much fun exploring the data as I've had, so I couldn't resist adding some interaction to the vacation visualization. I still wanted to retain the immediate visual appeal of a static image, so I decided to create a background showing the full data to introduce the visualization at a first glance, and then overlay an interactive foreground once the user started exploring it more deeply.
In most cases you're better off using one of the excellent off-the-shelf visualization frameworks like D3. Since I needed something client-side for interaction, and was working with both geographic and network rendering, I couldn't find anything that met my requirements. Instead I cannibalized one of my own projects, the jQuery component from OpenHeatMap, and combined it with HTML5 canvas rendering to produce a custom JavaScript renderer. I used it to pre-render a background containing all the possible connections between home towns and travel destinations, and saved that off as a static image. That's useful to save rendering time on page load, and lets me fall back to a static visualization on older browsers that don't support Canvas.
Click to enlarge.
I then tied in rendering the connections of any places that the user was hovering their cursor over, so that they could quickly get a feel for the relationships expressed in the data. I also wanted to display the details underlying the picture, so to drill down I added a dialog listing the raw statistics about a place. Users can bring this dialog up by clicking.
Click to enlarge.
One problem with that interaction is that a lot of different cities are in a very small area, so it becomes extremely difficult to pick the one you want with the mouse cursor. To make that a little better, I prioritized the most popular U.S. cities so that in case of a conflict, they're chosen over their smaller neighbors. I realized I also needed to add a search box. Thankfully we're heavy users of Twitter's Bootstrap framework, so it was a simple matter to add a search field and tie it in with Twitter's excellent autocomplete component.
Find the surprises!
I build these visualizations so I can explore them myself, so my favorite part of the whole process is the chance to sit and play with the results. There's always unexpected stories hidden in there, and I love uncovering them. For example, who knew that the city that had the most visitors to Paris was West Hollywood? When I lived in Los Angeles I used to love popping by the wonderful patisseries. Now I know why they're so good! These little details are the stories that catch people's imagination and cause them to spread the word, so think about writing a few of them up to help visitors understand what the page can tell them.
You'll never know whether one of your visualizations will become popular ahead of time, but the real reward is enjoying your own work. I hope this short guide gives you some ideas f[…]
Data
facebook
visualization
visualizationprocess
from google
Play with your data
I was lucky enough to spend a few hours with Andreas Weigend recently, head of the Stanford Social Data lab. He has nine rules of data, and the first is "Start with the problem, not the data." What struck me about visualizations is that I actually take the opposite approach. I find the only way to begin is to explore what information is available and get a feeling for what stories it can tell.
In my case, we have a Cassandra cluster with information on more than 350 million photos shared on Facebook. I've been running Pig analytics jobs regularly to get a view of what we have in there. One of the reports we generate is a count of how many photos and users we have for particular places:
Click to enlarge.
I was chatting with my colleague Chris Raynor about this, and he asked me if we could tell where all the visitors to those places were coming from. This was something that had been at the back of my mind for a long time. Seeing how much information we had on each destination made me realize we had enough data to produce significant and meaningful answers.
When I was learning engineering, one of my favorite case studies was an investigation into an air-traffic control system. Software engineers couldn't understand why fully-computerized control rooms were actually less efficient and safe than more old-fashioned sites. What the researchers discovered was that the old process of passing around and arranging small cards that each represented a plane gave controllers a much stronger awareness of the situation than a screen that didn't require their involvement for tasks, such as handing an aircraft to a colleague. I think the same is true of data. The more time you spend manipulating and examining the raw information, the more you understand it at a deep level. Knowing your data is the essential starting point for any visualization.
Pick a question
Now that I had a rough idea for what I wanted to visualize, I really needed to focus on what I would be doing. The best way to do that is to chose the exact title you want to give your visualization. I actually messed this up on one early map I created, giving the blog post the title "How to split up the US." Everyone subsequently described it as "The Five Nations of Facebook." Since then, I've tried very hard to pick the most natural title for what I'm going to be presenting, and then ensure I can deliver on the promise of the headline.
In this case I had a clear idea of the question at the start, it was going to be "Where do people go on vacation?". However, as I thought about it, I realized it needed to be a lot more specific and concrete. There's already a lot of "top travel destinations" lists out there, so what made mine different? It was the use of Facebook to gather much richer and more detailed information, so I refined it to "Where do Facebook users go on vacation?".
Sketch out your presentation
I now had the data and a question I wanted to answer. The next step was figuring out how to show the information in a visual form. I'm in love with network diagrams showing connections between thousands of objects, but so often they are completely baffling to the rest of the world. I still remember David Cohen threatening to strangle me if I showed him another one of "those damn spider webs" instead of a business plan. However, network diagrams are a good way of hinting at how much data is available for querying; they can really give an idea of the sheer scale of what's there.
One of my favorite recent visualizations was Paul Butler's map of friendships on Facebook, so I decided to use that as a visual reference:
See the full version of Paul Butler's "Visualizing Friendships" visualization.
I borrowed a couple of key ideas from his work: the general color palette of the blue lines on a dark background and the use of great circles to create flowing arcs for all connections.
As I thought about the presentation, I realized that I had to simplify what it would be showing. With sources and destinations plotted all over the world, both the visual look and the querying interface would be overwhelming. Our user-base is primarily American thanks to our reliance on English-only natural language processing, so with that in mind I decided to make life simpler by only showing data from people who lived in the U.S. Accordingly, I changed the question in my title to "Where do American Facebook users go on vacation?".
While I'm mostly presenting this as a linear, waterfall process, what I've just described is a good example of how iterative cycles drive the real workflow. It's hard to know how well a lot of things will work until you try them. As you're still making some progress, don't worry if you find yourself going in circles.
Crunch the data
If you know your data, and you have a good idea of the question you're trying to answer, this should be the simplest stage. You'll hopefully have a clear set of requirements and it's just a matter of executing the right queries over your data.
In this case I already had some Pig scripts asking similar questions, so I was able to adapt one of those. The biggest surprise was when I ran into issues with some of the joins. The hard part was running the Hadoop job to gather the raw data from our Cassandra cluster, and that worked. I was able to output smaller files containing the gathered data, and then run a local Pig job to do the joins I needed.
The next stage was turning the raw information into a form that could be displayed. For example, I needed to take all of the user locations from the unstructured text strings that Facebook gave me, and convert them into latitude-longitude coordinates for plotting on a map. For this sort of work I usually turn to a general-purpose scripting language, and most of Jetpac is already written in Ruby, so that was an easy choice. I wrote a script that walked through the data, using the Data Science Toolkit to match coordinates with names, and then output it into a file containing a JSON array of all the information.
Build an interface
A lot of the best visualizations have no interactivity. They just tell a story with a static image. That's why it's worth considering whether you need an interface at all. I actually had the interactive site that I used to create the "Five Nations of Facebook" visualization up for several weeks before that post, and nobody used it because it was too confusing. It was only when I boiled it down into a single picture with labels that it became a hit.
My problem is that I want other people to have as much fun exploring the data as I've had, so I couldn't resist adding some interaction to the vacation visualization. I still wanted to retain the immediate visual appeal of a static image, so I decided to create a background showing the full data to introduce the visualization at a first glance, and then overlay an interactive foreground once the user started exploring it more deeply.
In most cases you're better off using one of the excellent off-the-shelf visualization frameworks like D3. Since I needed something client-side for interaction, and was working with both geographic and network rendering, I couldn't find anything that met my requirements. Instead I cannibalized one of my own projects, the jQuery component from OpenHeatMap, and combined it with HTML5 canvas rendering to produce a custom JavaScript renderer. I used it to pre-render a background containing all the possible connections between home towns and travel destinations, and saved that off as a static image. That's useful to save rendering time on page load, and lets me fall back to a static visualization on older browsers that don't support Canvas.
Click to enlarge.
I then tied in rendering the connections of any places that the user was hovering their cursor over, so that they could quickly get a feel for the relationships expressed in the data. I also wanted to display the details underlying the picture, so to drill down I added a dialog listing the raw statistics about a place. Users can bring this dialog up by clicking.
Click to enlarge.
One problem with that interaction is that a lot of different cities are in a very small area, so it becomes extremely difficult to pick the one you want with the mouse cursor. To make that a little better, I prioritized the most popular U.S. cities so that in case of a conflict, they're chosen over their smaller neighbors. I realized I also needed to add a search box. Thankfully we're heavy users of Twitter's Bootstrap framework, so it was a simple matter to add a search field and tie it in with Twitter's excellent autocomplete component.
Find the surprises!
I build these visualizations so I can explore them myself, so my favorite part of the whole process is the chance to sit and play with the results. There's always unexpected stories hidden in there, and I love uncovering them. For example, who knew that the city that had the most visitors to Paris was West Hollywood? When I lived in Los Angeles I used to love popping by the wonderful patisseries. Now I know why they're so good! These little details are the stories that catch people's imagination and cause them to spread the word, so think about writing a few of them up to help visitors understand what the page can tell them.
You'll never know whether one of your visualizations will become popular ahead of time, but the real reward is enjoying your own work. I hope this short guide gives you some ideas f[…]
february 2012 by rahuldave
The NoSQL movement
february 2012 by rahuldave
In a conversation last year, Justin Sheehy, CTO of Basho, described NoSQL as a movement,
rather than a technology. This description immediately felt right;
I've never been comfortable talking about NoSQL, which when taken
literally, extends from the minimalist Berkeley DB (commercialized
as Sleepycat,
now owned by Oracle) to the big iron HBase, with detours into software as
fundamentally different as Neo4J (a
graph database) and FluidDB
(which defies description).
But what does it mean to say that NoSQL is a movement rather
than a technology? We certainly don't see picketers outside
Oracle's headquarters. Justin said succinctly that NoSQL is a
movement for choice in database architecture. There is no single
overarching technical theme; a single technology would belie the
principles of the movement.
Think of the last 15 years of software development. We've gotten
very good at building large, database-backed applications. Many of
them are web applications, but even more of them aren't. "Software
architect" is a valid job description; it's a position to which
many aspire. But what do software architects do? They specify the
high-level design of applications: the front end, the APIs, the
middleware, the business logic — the back end? Well, maybe not.
Since the '80s, the dominant back end of business systems has been a
relational database, whether Oracle, SQL Server or DB2. That's not
much of an architectural choice. Those are all great products, but
they're essentially similar, as are all the other relational
databases. And it's remarkable that we've explored many architectural
variations in the design of clients, front ends, and middleware, on a
multitude of platforms and frameworks, but haven't until recently
questioned the architecture of the back end. Relational databases have
been a given.
Many things have changed since the advent of relational
databases:
We're dealing with much more data. Although advances in storage
capacity and CPU speed have allowed the databases to keep pace,
we're in a new era where size itself is an important part of the
problem, and any significant database needs to be distributed.
We require sub-second responses to queries. In the '80s, most
database queries could run overnight as batch jobs. That's no
longer acceptable. While some analytic functions can still run as
overnight batch jobs, we've seen the web evolve from static files
to complex database-backed sites, and that requires sub-second
response times for most queries.
We want applications to be up 24/7. Setting up redundant
servers for static HTML files is easy, but a database replication
in a complex database-backed application is another.
We're seeing many applications in which the database has to
soak up data as fast (or even much faster) than it processes
queries: in a logging application, or a distributed sensor
application, writes can be much more frequent than reads.
Batch-oriented ETL (extract, transform, and load) hasn't
disappeared, and won't, but capturing high-speed data flows is
increasingly important.
We're frequently dealing with changing data or with
unstructured data. The data we collect, and how we use it, grows
over time in unpredictable ways. Unstructured data isn't a
particularly new feature of the data landscape, since unstructured
data has always existed, but we're increasingly unwilling to force
a structure on data a priori.
We're willing to sacrifice our sacred cows. We know that
consistency and isolation and other properties are very valuable,
of course. But so are some other things, like latency and
availability and not losing data even if our primary server goes
down. The challenges of modern applications make us realize that
sometimes we might need to weaken one of these constraints in order
to achieve another.
These changing requirements lead us to different tradeoffs and
compromises when designing software. They require us to rethink
what we require of a database, and to come up with answers aside
from the relational databases that have served us well over the
years. So let's look at these requirements in somewhat more
detail.
Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.
Size, response, availability
It's a given that any modern application is going to be
distributed. The size of modern datasets is only one reason for
distribution, and not the most important. Modern applications
(particularly web applications) have many concurrent users who
demand reasonably snappy response. In their 2009 Velocity Conference talk, Performance Related
Changes and their User Impact, Eric Schurman and Jake Brutlag
showed results from independent research projects at Google and
Microsoft. Both projects demonstrated imperceptibly small increases
in response time cause users to move to another site; if response
time is over a second, you're losing a very measurable percentage
of your traffic.
If you're not building a web application — say you're doing
business analytics, with complex, time-consuming queries — the
world has changed, and users now expect business analytics to run
in something like real time. Maybe not the sub-second latency
required for web users, but queries that run overnight are no
longer acceptable. Queries that run while you go out for coffee are
marginal. It's not just a matter of convenience; the ability to run
dozens or hundreds of queries per day changes the nature of the
work you do. You can be more experimental: you can follow through
on hunches and hints based on earlier queries. That kind of
spontaneity was impossible when research went through the DBA at
the data warehouse.
Whether you're building a customer-facing application or doing
internal analytics, scalability is a big issue. Vertical
scalability (buy a bigger, faster machine) always runs into limits.
Now that the laws of physics have stalled Intel-architecture clock
speeds in the 3.5GHz range, those limits are more apparent than
ever. Horizontal scalability (build a distributed system with more
nodes) is the only way to scale indefinitely. You're scaling
horizontally even if you're only buying single boxes: it's been a
long time since I've seen a server (or even a high-end desktop)
that doesn't sport at least four cores. Horizontal scalability is
tougher when you're scaling across racks of servers at a colocation
facility, but don't be deceived: that's how scalability works in
the 21st century, even on your laptop. Even in your cell phone. We
need database technologies that aren't just fast on single servers:
they must also scale across multiple servers.
Modern applications also need to be highly available. That goes
without saying, but think about how the meaning of "availability"
has changed over the years. Not much more than a decade ago, a web
application would have a single HTTP server that handed out static
files. These applications might be data-driven; but "data driven"
meant that a batch job rebuilt the web site overnight, and user
transactions were queued into a batch processing system, again for
processing overnight. Keeping such a system running isn't terribly
difficult. High availability doesn't impact the database. If the
database is only engaged in batched rebuilds or transaction
processing, the database can crash without damage. That's the world
for which relational databases were designed. In the '80s, if your
mainframe ran out of steam, you got a bigger one. If it crashed,
you were down. But when databases became a living, breathing part
of the application, availability became an issue. There is no way
to make a single system highly available; as soon as any component
fails, you're toast. Highly available systems are, by nature,
distributed systems.
If a distributed database is a given, the next question is how
much work a distributed system will require. There are
fundamentally two options: databases that have to be distributed
manually, via sharding; and databases that are inherently
distributed. Relational databases are split between multiple hosts
by manual sharding, or determining how to partition the datasets
based on some properties of the data itself: for example, first
names starting with A-K on one server, L-Z on another. A lot of
thought goes into designing a sharding and replication strategy
that doesn't impair performance, while keeping the data relatively
balanced between servers. There's a third option that is
essentially a hybrid: databases that are not inherently
distributed, but that are designed so they can be partitioned
easily. MongoDB is an example
of a database that can be sharded easily (or even automatically);
HBase, Riak, and Cassandra are all inherently
distributed, with options to control how replication and
distribution work.
What database choices are viable when you need good interactive
response? There are two separate issues: read latency and write
latency. For reasonably simple queries on a database with
well-designed indexes, almost any modern database can give decent
read latency, even at reasonably large scale. Similarly, just about
all modern databases claim to be able to keep up with writes at
high-speed. Most of these databases, including HBase, Cassandra,
Riak, and CouchDB, write
data immediately to an append-only file, which is an extremely
efficient operation. As a result, writes are often significantly
faster than reads.
Whether any particular database can deliver the performance you
need depends on the nature of the application, and whether you've
designed the application in a way that uses the database
efficiently: in particular, the structure of queries, more than the
structure[…]
Data
databases
nonrelationaldatabase
nosql
planningforbigdata
from google
rather than a technology. This description immediately felt right;
I've never been comfortable talking about NoSQL, which when taken
literally, extends from the minimalist Berkeley DB (commercialized
as Sleepycat,
now owned by Oracle) to the big iron HBase, with detours into software as
fundamentally different as Neo4J (a
graph database) and FluidDB
(which defies description).
But what does it mean to say that NoSQL is a movement rather
than a technology? We certainly don't see picketers outside
Oracle's headquarters. Justin said succinctly that NoSQL is a
movement for choice in database architecture. There is no single
overarching technical theme; a single technology would belie the
principles of the movement.
Think of the last 15 years of software development. We've gotten
very good at building large, database-backed applications. Many of
them are web applications, but even more of them aren't. "Software
architect" is a valid job description; it's a position to which
many aspire. But what do software architects do? They specify the
high-level design of applications: the front end, the APIs, the
middleware, the business logic — the back end? Well, maybe not.
Since the '80s, the dominant back end of business systems has been a
relational database, whether Oracle, SQL Server or DB2. That's not
much of an architectural choice. Those are all great products, but
they're essentially similar, as are all the other relational
databases. And it's remarkable that we've explored many architectural
variations in the design of clients, front ends, and middleware, on a
multitude of platforms and frameworks, but haven't until recently
questioned the architecture of the back end. Relational databases have
been a given.
Many things have changed since the advent of relational
databases:
We're dealing with much more data. Although advances in storage
capacity and CPU speed have allowed the databases to keep pace,
we're in a new era where size itself is an important part of the
problem, and any significant database needs to be distributed.
We require sub-second responses to queries. In the '80s, most
database queries could run overnight as batch jobs. That's no
longer acceptable. While some analytic functions can still run as
overnight batch jobs, we've seen the web evolve from static files
to complex database-backed sites, and that requires sub-second
response times for most queries.
We want applications to be up 24/7. Setting up redundant
servers for static HTML files is easy, but a database replication
in a complex database-backed application is another.
We're seeing many applications in which the database has to
soak up data as fast (or even much faster) than it processes
queries: in a logging application, or a distributed sensor
application, writes can be much more frequent than reads.
Batch-oriented ETL (extract, transform, and load) hasn't
disappeared, and won't, but capturing high-speed data flows is
increasingly important.
We're frequently dealing with changing data or with
unstructured data. The data we collect, and how we use it, grows
over time in unpredictable ways. Unstructured data isn't a
particularly new feature of the data landscape, since unstructured
data has always existed, but we're increasingly unwilling to force
a structure on data a priori.
We're willing to sacrifice our sacred cows. We know that
consistency and isolation and other properties are very valuable,
of course. But so are some other things, like latency and
availability and not losing data even if our primary server goes
down. The challenges of modern applications make us realize that
sometimes we might need to weaken one of these constraints in order
to achieve another.
These changing requirements lead us to different tradeoffs and
compromises when designing software. They require us to rethink
what we require of a database, and to come up with answers aside
from the relational databases that have served us well over the
years. So let's look at these requirements in somewhat more
detail.
Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.
Size, response, availability
It's a given that any modern application is going to be
distributed. The size of modern datasets is only one reason for
distribution, and not the most important. Modern applications
(particularly web applications) have many concurrent users who
demand reasonably snappy response. In their 2009 Velocity Conference talk, Performance Related
Changes and their User Impact, Eric Schurman and Jake Brutlag
showed results from independent research projects at Google and
Microsoft. Both projects demonstrated imperceptibly small increases
in response time cause users to move to another site; if response
time is over a second, you're losing a very measurable percentage
of your traffic.
If you're not building a web application — say you're doing
business analytics, with complex, time-consuming queries — the
world has changed, and users now expect business analytics to run
in something like real time. Maybe not the sub-second latency
required for web users, but queries that run overnight are no
longer acceptable. Queries that run while you go out for coffee are
marginal. It's not just a matter of convenience; the ability to run
dozens or hundreds of queries per day changes the nature of the
work you do. You can be more experimental: you can follow through
on hunches and hints based on earlier queries. That kind of
spontaneity was impossible when research went through the DBA at
the data warehouse.
Whether you're building a customer-facing application or doing
internal analytics, scalability is a big issue. Vertical
scalability (buy a bigger, faster machine) always runs into limits.
Now that the laws of physics have stalled Intel-architecture clock
speeds in the 3.5GHz range, those limits are more apparent than
ever. Horizontal scalability (build a distributed system with more
nodes) is the only way to scale indefinitely. You're scaling
horizontally even if you're only buying single boxes: it's been a
long time since I've seen a server (or even a high-end desktop)
that doesn't sport at least four cores. Horizontal scalability is
tougher when you're scaling across racks of servers at a colocation
facility, but don't be deceived: that's how scalability works in
the 21st century, even on your laptop. Even in your cell phone. We
need database technologies that aren't just fast on single servers:
they must also scale across multiple servers.
Modern applications also need to be highly available. That goes
without saying, but think about how the meaning of "availability"
has changed over the years. Not much more than a decade ago, a web
application would have a single HTTP server that handed out static
files. These applications might be data-driven; but "data driven"
meant that a batch job rebuilt the web site overnight, and user
transactions were queued into a batch processing system, again for
processing overnight. Keeping such a system running isn't terribly
difficult. High availability doesn't impact the database. If the
database is only engaged in batched rebuilds or transaction
processing, the database can crash without damage. That's the world
for which relational databases were designed. In the '80s, if your
mainframe ran out of steam, you got a bigger one. If it crashed,
you were down. But when databases became a living, breathing part
of the application, availability became an issue. There is no way
to make a single system highly available; as soon as any component
fails, you're toast. Highly available systems are, by nature,
distributed systems.
If a distributed database is a given, the next question is how
much work a distributed system will require. There are
fundamentally two options: databases that have to be distributed
manually, via sharding; and databases that are inherently
distributed. Relational databases are split between multiple hosts
by manual sharding, or determining how to partition the datasets
based on some properties of the data itself: for example, first
names starting with A-K on one server, L-Z on another. A lot of
thought goes into designing a sharding and replication strategy
that doesn't impair performance, while keeping the data relatively
balanced between servers. There's a third option that is
essentially a hybrid: databases that are not inherently
distributed, but that are designed so they can be partitioned
easily. MongoDB is an example
of a database that can be sharded easily (or even automatically);
HBase, Riak, and Cassandra are all inherently
distributed, with options to control how replication and
distribution work.
What database choices are viable when you need good interactive
response? There are two separate issues: read latency and write
latency. For reasonably simple queries on a database with
well-designed indexes, almost any modern database can give decent
read latency, even at reasonably large scale. Similarly, just about
all modern databases claim to be able to keep up with writes at
high-speed. Most of these databases, including HBase, Cassandra,
Riak, and CouchDB, write
data immediately to an append-only file, which is an extremely
efficient operation. As a result, writes are often significantly
faster than reads.
Whether any particular database can deliver the performance you
need depends on the nature of the application, and whether you've
designed the application in a way that uses the database
efficiently: in particular, the structure of queries, more than the
structure[…]
february 2012 by rahuldave
What is Apache Hadoop?
february 2012 by rahuldave
Apache Hadoop has been
the driving force behind the growth of the big data industry. You'll
hear it mentioned often, along with associated technologies such as
Hive and Pig. But what does it do, and why do you need all its
strangely-named friends, such as Oozie, Zookeeper and Flume?
Hadoop brings the ability to cheaply process large amounts of
data, regardless of its structure. By large, we mean from 10-100
gigabytes and above. How is this different from what went before?
Existing enterprise data warehouses and relational databases excel
at processing structured data and can store massive amounts of
data, though at a cost: This requirement for structure restricts the kinds of
data that can be processed, and it imposes an inertia that makes
data warehouses unsuited for agile exploration of massive
heterogenous data. The amount of effort required to warehouse data
often means that valuable data sources in organizations are never
mined. This is where Hadoop can make a big difference.
This article examines the components of the Hadoop ecosystem and
explains the functions of each.
The core of Hadoop: MapReduce
Created at
Google in response to the problem of creating web search
indexes, the MapReduce framework is the powerhouse behind most of
today's big data processing. In addition to Hadoop, you'll find
MapReduce inside MPP and NoSQL databases, such as Vertica or MongoDB.
The important innovation of MapReduce is the ability to take a query
over a dataset, divide it, and run it in parallel over multiple
nodes. Distributing the computation solves the issue of data too large to fit
onto a single machine. Combine this technique with commodity Linux
servers and you have a cost-effective alternative to massive
computing arrays.
At its core, Hadoop is an open source MapReduce
implementation. Funded by Yahoo, it emerged in 2006 and,
href="http://research.yahoo.com/files/cutting.pdf">according to its
creator Doug Cutting, reached "web scale" capability in early
2008.
As the Hadoop project matured, it acquired further components to enhance
its usability and functionality. The name "Hadoop" has
come to represent this entire ecosystem. There are parallels
with the emergence of Linux: The name refers strictly to the Linux
kernel, but it has gained acceptance as referring to a complete
operating system.
Hadoop's lower levels: HDFS and MapReduce
Above, we discussed the ability of MapReduce to distribute
computation over multiple servers. For that computation to take
place, each server must have access to the data. This is the role of
HDFS, the Hadoop Distributed File System.
HDFS and MapReduce are robust. Servers in a Hadoop cluster can
fail and not abort the computation process. HDFS ensures data is
replicated with redundancy across the cluster. On completion of a
calculation, a node will write its results back into HDFS.
There are no restrictions on the data that HDFS stores. Data may
be unstructured and schemaless. By contrast, relational databases
require that data be structured and schemas be defined before storing
the data. With HDFS, making sense of the data is the responsibility
of the developer's code.
Programming Hadoop at the MapReduce level is a case of working with the
Java APIs, and manually loading data files into HDFS.
Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.
Improving programmability: Pig and Hive
Working directly with Java APIs can be tedious and error prone.
It also restricts usage of Hadoop to Java programmers. Hadoop offers
two solutions for making Hadoop programming easier.
Pig is a programming
language that simplifies the common tasks of working with Hadoop:
loading data, expressing transformations on the data, and storing
the final results. Pig's built-in operations can make sense of
semi-structured data, such as log files, and the language is
extensible using Java to add support for custom data types and
transformations.
Hive enables Hadoop
to operate as a data warehouse. It superimposes structure on data in HDFS
and then permits queries over the data using a familiar SQL-like
syntax. As with Pig, Hive's core capabilities are
extensible.
Choosing between Hive and Pig can be confusing. Hive
is more suitable for data warehousing tasks, with predominantly
static structure and the need for frequent analysis. Hive's closeness
to SQL makes it an ideal point of integration between Hadoop and
other business intelligence tools.
Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming
data flows for incorporation into larger applications. Pig is a
thinner layer over Hadoop than Hive, and its main advantage is to
drastically cut the amount of code needed compared to direct
use of Hadoop's Java APIs. As such, Pig's intended audience remains
primarily the software developer.
Improving data access: HBase, Sqoop and Flume
At its heart, Hadoop is a batch-oriented system. Data are loaded
into HDFS, processed, and then retrieved. This is somewhat of a
computing throwback, and often, interactive and random access to data
is required.
Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google's
href="http://research.google.com/archive/bigtable.html">BigTable,
the project's goal is to host billions of rows of data for rapid access.
MapReduce
can use HBase as both a source and a destination for its
computations, and Hive and Pig can be used in combination with
HBase.
In order to grant random access to the data, HBase does impose a
few restrictions: Hive performance with HBase is 4-5 times slower than with plain
HDFS, and the maximum amount of data you can store in HBase is approximately
a petabyte, versus HDFS' limit of over 30PB.
HBase is ill-suited to ad-hoc analytics and more appropriate for
integrating big data as part of a larger application. Use cases
include logging, counting and storing time-series data.
The Hadoop Bestiary
Ambari
Deployment, configuration and monitoring
Flume
Collection and import of log and event data
HBase
Column-oriented database scaling to billions of rows
HCatalog
Schema and data type sharing over Pig, Hive and MapReduce
HDFS
Distributed redundant file system for Hadoop
Hive
Data warehouse with SQL-like access
Mahout
Library of machine learning and data mining algorithms
MapReduce
Parallel computation on server clusters
Pig
High-level programming language for Hadoop computations
Oozie
Orchestration and workflow management
Sqoop
Imports data from relational databases
Whirr
Cloud-agnostic deployment of clusters
Zookeeper
Configuration management and coordination
Getting data in and out
Improved interoperability with the rest of the data world is
provided by
href="https://github.com/cloudera/sqoop/wiki">Sqoop and
href="https://cwiki.apache.org/FLUME/">Flume. Sqoop is a tool designed to import data from
relational databases into Hadoop, either directly into HDFS or into
Hive. Flume is designed to import streaming flows of log data
directly into HDFS.
Hive's SQL friendliness means that it can be used as a point of
integration with the vast universe of database tools capable of making
connections via JBDC or ODBC database drivers.
Coordination and workflow: Zookeeper and Oozie
With a growing family of services running as part of a Hadoop
cluster, there's a need for coordination and naming services. As
computing nodes can come and go, members of the cluster need
to synchronize with each other, know where to access services, and
know how they should be configured. This is the purpose of
href="http://zookeeper.apache.org/">Zookeeper.
Production systems utilizing Hadoop can often contain complex
pipelines of transformations, each with dependencies on each
other. For example, the arrival of a new batch of data will trigger
an import, which must then trigger recalculations in dependent
datasets. The Oozie
component provides features to manage the workflow and dependencies,
removing the need for developers to code custom solutions.
Management and deployment: Ambari and Whirr
One of the commonly added features incorporated into Hadoop by
distributors such as IBM and Microsoft is monitoring and
administration. Though in an early stage,
href="http://incubator.apache.org/ambari/">Ambari aims
to add these features to the core Hadoop project. Ambari is intended to help system
administrators deploy and configure Hadoop, upgrade clusters, and
monitor services. Through an API, it may be integrated with other
system management tools.
Though not strictly part of Hadoop,
href="http://whirr.apache.org/">Whirr is a highly complementary
component. It offers a way of running services, including Hadoop, on
cloud platforms. Whirr is cloud neutral and
currently supports the Amazon EC2 and Rackspace services.
Machine learning: […]
Data
bigdata
datatool
hadoop
planningforbigdata
from google
the driving force behind the growth of the big data industry. You'll
hear it mentioned often, along with associated technologies such as
Hive and Pig. But what does it do, and why do you need all its
strangely-named friends, such as Oozie, Zookeeper and Flume?
Hadoop brings the ability to cheaply process large amounts of
data, regardless of its structure. By large, we mean from 10-100
gigabytes and above. How is this different from what went before?
Existing enterprise data warehouses and relational databases excel
at processing structured data and can store massive amounts of
data, though at a cost: This requirement for structure restricts the kinds of
data that can be processed, and it imposes an inertia that makes
data warehouses unsuited for agile exploration of massive
heterogenous data. The amount of effort required to warehouse data
often means that valuable data sources in organizations are never
mined. This is where Hadoop can make a big difference.
This article examines the components of the Hadoop ecosystem and
explains the functions of each.
The core of Hadoop: MapReduce
Created at
Google in response to the problem of creating web search
indexes, the MapReduce framework is the powerhouse behind most of
today's big data processing. In addition to Hadoop, you'll find
MapReduce inside MPP and NoSQL databases, such as Vertica or MongoDB.
The important innovation of MapReduce is the ability to take a query
over a dataset, divide it, and run it in parallel over multiple
nodes. Distributing the computation solves the issue of data too large to fit
onto a single machine. Combine this technique with commodity Linux
servers and you have a cost-effective alternative to massive
computing arrays.
At its core, Hadoop is an open source MapReduce
implementation. Funded by Yahoo, it emerged in 2006 and,
href="http://research.yahoo.com/files/cutting.pdf">according to its
creator Doug Cutting, reached "web scale" capability in early
2008.
As the Hadoop project matured, it acquired further components to enhance
its usability and functionality. The name "Hadoop" has
come to represent this entire ecosystem. There are parallels
with the emergence of Linux: The name refers strictly to the Linux
kernel, but it has gained acceptance as referring to a complete
operating system.
Hadoop's lower levels: HDFS and MapReduce
Above, we discussed the ability of MapReduce to distribute
computation over multiple servers. For that computation to take
place, each server must have access to the data. This is the role of
HDFS, the Hadoop Distributed File System.
HDFS and MapReduce are robust. Servers in a Hadoop cluster can
fail and not abort the computation process. HDFS ensures data is
replicated with redundancy across the cluster. On completion of a
calculation, a node will write its results back into HDFS.
There are no restrictions on the data that HDFS stores. Data may
be unstructured and schemaless. By contrast, relational databases
require that data be structured and schemas be defined before storing
the data. With HDFS, making sense of the data is the responsibility
of the developer's code.
Programming Hadoop at the MapReduce level is a case of working with the
Java APIs, and manually loading data files into HDFS.
Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.
Improving programmability: Pig and Hive
Working directly with Java APIs can be tedious and error prone.
It also restricts usage of Hadoop to Java programmers. Hadoop offers
two solutions for making Hadoop programming easier.
Pig is a programming
language that simplifies the common tasks of working with Hadoop:
loading data, expressing transformations on the data, and storing
the final results. Pig's built-in operations can make sense of
semi-structured data, such as log files, and the language is
extensible using Java to add support for custom data types and
transformations.
Hive enables Hadoop
to operate as a data warehouse. It superimposes structure on data in HDFS
and then permits queries over the data using a familiar SQL-like
syntax. As with Pig, Hive's core capabilities are
extensible.
Choosing between Hive and Pig can be confusing. Hive
is more suitable for data warehousing tasks, with predominantly
static structure and the need for frequent analysis. Hive's closeness
to SQL makes it an ideal point of integration between Hadoop and
other business intelligence tools.
Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming
data flows for incorporation into larger applications. Pig is a
thinner layer over Hadoop than Hive, and its main advantage is to
drastically cut the amount of code needed compared to direct
use of Hadoop's Java APIs. As such, Pig's intended audience remains
primarily the software developer.
Improving data access: HBase, Sqoop and Flume
At its heart, Hadoop is a batch-oriented system. Data are loaded
into HDFS, processed, and then retrieved. This is somewhat of a
computing throwback, and often, interactive and random access to data
is required.
Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google's
href="http://research.google.com/archive/bigtable.html">BigTable,
the project's goal is to host billions of rows of data for rapid access.
MapReduce
can use HBase as both a source and a destination for its
computations, and Hive and Pig can be used in combination with
HBase.
In order to grant random access to the data, HBase does impose a
few restrictions: Hive performance with HBase is 4-5 times slower than with plain
HDFS, and the maximum amount of data you can store in HBase is approximately
a petabyte, versus HDFS' limit of over 30PB.
HBase is ill-suited to ad-hoc analytics and more appropriate for
integrating big data as part of a larger application. Use cases
include logging, counting and storing time-series data.
The Hadoop Bestiary
Ambari
Deployment, configuration and monitoring
Flume
Collection and import of log and event data
HBase
Column-oriented database scaling to billions of rows
HCatalog
Schema and data type sharing over Pig, Hive and MapReduce
HDFS
Distributed redundant file system for Hadoop
Hive
Data warehouse with SQL-like access
Mahout
Library of machine learning and data mining algorithms
MapReduce
Parallel computation on server clusters
Pig
High-level programming language for Hadoop computations
Oozie
Orchestration and workflow management
Sqoop
Imports data from relational databases
Whirr
Cloud-agnostic deployment of clusters
Zookeeper
Configuration management and coordination
Getting data in and out
Improved interoperability with the rest of the data world is
provided by
href="https://github.com/cloudera/sqoop/wiki">Sqoop and
href="https://cwiki.apache.org/FLUME/">Flume. Sqoop is a tool designed to import data from
relational databases into Hadoop, either directly into HDFS or into
Hive. Flume is designed to import streaming flows of log data
directly into HDFS.
Hive's SQL friendliness means that it can be used as a point of
integration with the vast universe of database tools capable of making
connections via JBDC or ODBC database drivers.
Coordination and workflow: Zookeeper and Oozie
With a growing family of services running as part of a Hadoop
cluster, there's a need for coordination and naming services. As
computing nodes can come and go, members of the cluster need
to synchronize with each other, know where to access services, and
know how they should be configured. This is the purpose of
href="http://zookeeper.apache.org/">Zookeeper.
Production systems utilizing Hadoop can often contain complex
pipelines of transformations, each with dependencies on each
other. For example, the arrival of a new batch of data will trigger
an import, which must then trigger recalculations in dependent
datasets. The Oozie
component provides features to manage the workflow and dependencies,
removing the need for developers to code custom solutions.
Management and deployment: Ambari and Whirr
One of the commonly added features incorporated into Hadoop by
distributors such as IBM and Microsoft is monitoring and
administration. Though in an early stage,
href="http://incubator.apache.org/ambari/">Ambari aims
to add these features to the core Hadoop project. Ambari is intended to help system
administrators deploy and configure Hadoop, upgrade clusters, and
monitor services. Through an API, it may be integrated with other
system management tools.
Though not strictly part of Hadoop,
href="http://whirr.apache.org/">Whirr is a highly complementary
component. It offers a way of running services, including Hadoop, on
cloud platforms. Whirr is cloud neutral and
currently supports the Amazon EC2 and Rackspace services.
Machine learning: […]
february 2012 by rahuldave
The Daily Dot wants to tell the web's story with social data journalism
august 2011 by rahuldave
If the Internet is the public square of the 21st century, the Daily Dot wants to be its town crier. The newly launched online media startup is trying an experiment in community journalism, where the community is the web. It's an interesting vision, and one that looks to capitalize on the amount of time people are spending online.
The Daily Dot wants to tell stories through a mix of data journalism and old-fashioned reporting, where its journalists pick up the phone and chase down the who, what, when, where, how and why of a video, image or story that's burning up the social web. The site's beat writers, who are members of the communities they cover, watch what's happening on Twitter, Facebook, Reddit, YouTube, Tumblr and Etsy, and then cover the issues and people that matter to them.
Even if the newspaper metaphor has some flaws, this focus on original reporting could help distinguish the Daily Dot in a media landscape where attention and quality are both fleeting. In the hurly burly of the tech and new media blogosphere, picking up the phone to chase down a story is too often neglected.
There's something significant about that approach. Former VentureBeat editor Owen Thomas (@OwenThomas), the founding editor of the Daily Dot, has emphasized this angle in interviews with AdWeek and Forbes. Instead of mocking what people do online, as many mainstream media outlets have been doing for decades, the Daily Dot will tell their stories in the same way that a local newspaper might cover a country fair or concert. While Thomas was a well-known master of snark and satire during his tenure at Valleywag, in this context he's changed his style.
Where's the social data?
Whether or not this approach gains traction within the communities the Daily Dot covers remains to be seen. The Daily Dot was co-founded by Nova Spivack, former newspaper executive Nicholas White, and PR consultant Josh Jones-Dilworth, with a reported investment of some $600,000 from friends and family. White has written that he gave up the newspaper to save newspapering. Simply put, the Daily Dot is experimenting with covering the Internet in a way that most newspapers have failed to do.
"I trust that if we keep following people into the places where they gather to trade gossip, argue the issues, seek inspiration, and share lives, then we will also find communities in need of quality journalism," wrote White. "We will be carrying the tradition of local community-based journalism into the digital world, a professional coverage, practice and ethics coupled with the kind of local interaction and engagement required of a relevant and meaningful news source. Yet local to us means the digital communities that are today every bit as vibrant as those geographically defined localities."
To do that, they'll be tapping into an area that Spivack, a long-time technology entrepreneur, has been investing and writing about for years: data. Specifically, applying data journalism to mining and analyzing the social data from two of the web's most vibrant platforms: Tumblr and Reddit.
White himself is unequivocal about the necessity of data journalism in the new digital landscape, whether at the Daily Dot or beyond:
The Daily Dot may be going in this direction now because of our unique coverage area, but if this industry is to flourish in the 21st century, programming journalists should not remain unique. Data, just like the views of experts, men on the street, polls and participants, is a perspective on the world. And in the age of ATMs, automatic doors and customer loyalty cards, it's become just as ubiquitous. But the media isn't so good with data, with actual mathematics. Our stock-in-trade is the anecdote. Despite a complete lack of solid evidence, we've been telling people their cell phones will give them cancer. Our society ping-pongs between eating and not eating carbs, drinking too much coffee and not enough water, getting more Omega-3s — all on the basis of epidemiological research that is far, far, far from definitive. Most reporters do not know how to evaluate research studies, and so they report the authors' conclusions without any critical evaluation — and studies need critical evaluation.
Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.
Save 30% on registration with the code STN11RAD
Marshall Kirkpatrick, a proponent and practitioner of data journalism, dug deep into how data journalism happens at the Daily Dot. While he's similarly unsure of whether the publication will be interesting to a large enough audience to sustain an advertising venture, the way that the Daily Dot is going about hunting down digital stories is notable. Kirkpatrick shared the details over at ReadWriteWeb:
In order to capture and analyze that data from sites like Twitter, YouTube, Reddit, Etsy and more (the team says it's indexing a new community about every six weeks), the Dot has partnered with the mathematicians at Ravel Data. Ravel uses 80Legs for unblockable crawling, then Hadoop, its own open source framework called GoldenOrb and then an Eigenvector centrality algorithm (similar to Pagerank) to index, analyze, rank and discover connections between millions of users across these social networks.
There are a couple of aspects of data journalism to consider here. One is supplementing the traditional "nose for news" that Daily Dot writers apply to finding stories. "The data really begins to serve as our editorial prosthetics of sorts, telling us where to look, with whom to speak, and giving us the basic groundwork of the communities that we can continue to prod in interesting ways and ask questions of," explained Doug Freeman, an associate at Daily Dot investor Josh Jones-Dilworth's PR firm, in an interview. In other words, the editors of the Daily Dot analyze social data to identify the community's best sources for stories and share them on a "Leaderboard" that — in beta — shows a ranked list of members of Tumblr and Reddit.
Another open question is how social data could help with the startup's revenue down the road. "Our data business is a way of creating and funding new value in this regard; we instigated structured crawls of all of the communities we will cover and will continue to do so as we expand into new places," said Freeman. "We started with Reddit (for data and editorial both) because it is small and has a lot of complex properties — a good test balloon. We've now completed data work with Tumblr and YouTube and are continuing." For each community, data provides a view of members, behaviors, and influence dynamics.
That data also relates to how the Daily Dot approaches marketing, branding and advertising. "It's essentially a to-do list of people we need to get reading the Dot, and a list of their behaviors," said Freeman. "From a brand [point of view], it's market and audience intelligence that we can leverage, with services alongside it. From an advertiser [point of view], this data gives resolution and insight that few other outlets can provide. It will get even more exciting over time as we start to tie Leaderboard data to user accounts and instigate CPA-based campaigns with bonuses and bounties for highly influential clicks."
Taken as a whole, what the Daily Dot is doing with social data and digital journalism feels new, or at least like a new evolution. We've seen Facebook and Twitter integration into major media sites, but not Reddit and Tumblr. It could be that the communities of these sites acting as "curation layers" for the web will produce excellent results in terms of popular content, though relevance could still be at issue. Whether this venture in data journalism is successful or not will depend upon it retaining the interest and loyalty of the communities it covers. What is clear, for now, is that the experiment will be fun to watch — cute LOL cats and all.
Related:
Data journalism, data tools, and the newsroom stack
The growing importance of data journalism
Before you interrogate data, you must tame it
Social data is an oracle waiting for a question
Data
Publishing
databusiness
datajournalism
dataproduct
datatools
media
startup
from google
The Daily Dot wants to tell stories through a mix of data journalism and old-fashioned reporting, where its journalists pick up the phone and chase down the who, what, when, where, how and why of a video, image or story that's burning up the social web. The site's beat writers, who are members of the communities they cover, watch what's happening on Twitter, Facebook, Reddit, YouTube, Tumblr and Etsy, and then cover the issues and people that matter to them.
Even if the newspaper metaphor has some flaws, this focus on original reporting could help distinguish the Daily Dot in a media landscape where attention and quality are both fleeting. In the hurly burly of the tech and new media blogosphere, picking up the phone to chase down a story is too often neglected.
There's something significant about that approach. Former VentureBeat editor Owen Thomas (@OwenThomas), the founding editor of the Daily Dot, has emphasized this angle in interviews with AdWeek and Forbes. Instead of mocking what people do online, as many mainstream media outlets have been doing for decades, the Daily Dot will tell their stories in the same way that a local newspaper might cover a country fair or concert. While Thomas was a well-known master of snark and satire during his tenure at Valleywag, in this context he's changed his style.
Where's the social data?
Whether or not this approach gains traction within the communities the Daily Dot covers remains to be seen. The Daily Dot was co-founded by Nova Spivack, former newspaper executive Nicholas White, and PR consultant Josh Jones-Dilworth, with a reported investment of some $600,000 from friends and family. White has written that he gave up the newspaper to save newspapering. Simply put, the Daily Dot is experimenting with covering the Internet in a way that most newspapers have failed to do.
"I trust that if we keep following people into the places where they gather to trade gossip, argue the issues, seek inspiration, and share lives, then we will also find communities in need of quality journalism," wrote White. "We will be carrying the tradition of local community-based journalism into the digital world, a professional coverage, practice and ethics coupled with the kind of local interaction and engagement required of a relevant and meaningful news source. Yet local to us means the digital communities that are today every bit as vibrant as those geographically defined localities."
To do that, they'll be tapping into an area that Spivack, a long-time technology entrepreneur, has been investing and writing about for years: data. Specifically, applying data journalism to mining and analyzing the social data from two of the web's most vibrant platforms: Tumblr and Reddit.
White himself is unequivocal about the necessity of data journalism in the new digital landscape, whether at the Daily Dot or beyond:
The Daily Dot may be going in this direction now because of our unique coverage area, but if this industry is to flourish in the 21st century, programming journalists should not remain unique. Data, just like the views of experts, men on the street, polls and participants, is a perspective on the world. And in the age of ATMs, automatic doors and customer loyalty cards, it's become just as ubiquitous. But the media isn't so good with data, with actual mathematics. Our stock-in-trade is the anecdote. Despite a complete lack of solid evidence, we've been telling people their cell phones will give them cancer. Our society ping-pongs between eating and not eating carbs, drinking too much coffee and not enough water, getting more Omega-3s — all on the basis of epidemiological research that is far, far, far from definitive. Most reporters do not know how to evaluate research studies, and so they report the authors' conclusions without any critical evaluation — and studies need critical evaluation.
Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.
Save 30% on registration with the code STN11RAD
Marshall Kirkpatrick, a proponent and practitioner of data journalism, dug deep into how data journalism happens at the Daily Dot. While he's similarly unsure of whether the publication will be interesting to a large enough audience to sustain an advertising venture, the way that the Daily Dot is going about hunting down digital stories is notable. Kirkpatrick shared the details over at ReadWriteWeb:
In order to capture and analyze that data from sites like Twitter, YouTube, Reddit, Etsy and more (the team says it's indexing a new community about every six weeks), the Dot has partnered with the mathematicians at Ravel Data. Ravel uses 80Legs for unblockable crawling, then Hadoop, its own open source framework called GoldenOrb and then an Eigenvector centrality algorithm (similar to Pagerank) to index, analyze, rank and discover connections between millions of users across these social networks.
There are a couple of aspects of data journalism to consider here. One is supplementing the traditional "nose for news" that Daily Dot writers apply to finding stories. "The data really begins to serve as our editorial prosthetics of sorts, telling us where to look, with whom to speak, and giving us the basic groundwork of the communities that we can continue to prod in interesting ways and ask questions of," explained Doug Freeman, an associate at Daily Dot investor Josh Jones-Dilworth's PR firm, in an interview. In other words, the editors of the Daily Dot analyze social data to identify the community's best sources for stories and share them on a "Leaderboard" that — in beta — shows a ranked list of members of Tumblr and Reddit.
Another open question is how social data could help with the startup's revenue down the road. "Our data business is a way of creating and funding new value in this regard; we instigated structured crawls of all of the communities we will cover and will continue to do so as we expand into new places," said Freeman. "We started with Reddit (for data and editorial both) because it is small and has a lot of complex properties — a good test balloon. We've now completed data work with Tumblr and YouTube and are continuing." For each community, data provides a view of members, behaviors, and influence dynamics.
That data also relates to how the Daily Dot approaches marketing, branding and advertising. "It's essentially a to-do list of people we need to get reading the Dot, and a list of their behaviors," said Freeman. "From a brand [point of view], it's market and audience intelligence that we can leverage, with services alongside it. From an advertiser [point of view], this data gives resolution and insight that few other outlets can provide. It will get even more exciting over time as we start to tie Leaderboard data to user accounts and instigate CPA-based campaigns with bonuses and bounties for highly influential clicks."
Taken as a whole, what the Daily Dot is doing with social data and digital journalism feels new, or at least like a new evolution. We've seen Facebook and Twitter integration into major media sites, but not Reddit and Tumblr. It could be that the communities of these sites acting as "curation layers" for the web will produce excellent results in terms of popular content, though relevance could still be at issue. Whether this venture in data journalism is successful or not will depend upon it retaining the interest and loyalty of the communities it covers. What is clear, for now, is that the experiment will be fun to watch — cute LOL cats and all.
Related:
Data journalism, data tools, and the newsroom stack
The growing importance of data journalism
Before you interrogate data, you must tame it
Social data is an oracle waiting for a question
august 2011 by rahuldave
An iTunes model for data
april 2011 by rahuldave
As we move toward a data economy, can we take the digital content model and apply it to data acquisition and sales? That's a suggestion that Gil Elbaz (@gilelbaz), CEO and co-founder of the data platform Factual made in passing at his recent talk at Web 2.0 Expo.
Elbaz spoke about some of the hurdles that startups face with big data — not just the question of storage, but the question of access. But as he addressed the emerging data economy, Elbaz said we will likely see novel access methods and new marketplaces for data. Startups will be able to build value-added services on top of big data, rather than having to worry about gathering and storing the data themselves. "An iTunes for data," is how he described it.
So what would it mean to apply the iTunes model to data sales and distribution? I asked Elbaz to expand on his thoughts.
What problems does an iTunes model for data solve?
Gil Elbaz: One key framework that will catalyze data sharing, licensing and consumption will be an open data marketplace. It is a place where data can be programmatically searched, licensed, accessed, and integrated directly into a consumer application. One might call it the "eBay of data" or the "iTunes of data." iTunes might be the better metaphor because it's not just the content that is valuable, but also the convenience of the distribution channel and the ability to pay for only what you will consume.
How would an iTunes model for data address licensing and ownership?
Gil Elbaz: In the case of iTunes, in a single click I purchase a track, download it, establish licensing rights on my iPhone and up to four other authorized devices, and it's immediately integrated into my daily life. Similarly, the deepest value will come for a marketplace that, with a single click, allows a developer to license data and have it automatically integrated into their particular application development stack. That might mean having the data instantly accessible via API, automatically replicated to a MySQL server on EC2, synchronized at Database.com, or copied to Google App Engine.
An iTunes for data could be priced from a single record/entity to a complete dataset. And it could be licensed for single use, caching allowed for 24 hours, or perpetual rights for a specific application.
What needs to happen for us to move away from "buying the whole album" to buying the data equivalent of a single?
Gil Elbaz: The marketplace will eventually facilitate competitive bidding, which will bring the price down for developers. iTunes is based on a fairly simple set-pricing model. But, in a world of multiple data vendors with commodity data, only truly unique data will command a premium price. And, of course, we'll need great search technology to find the right data or data API based on the developer's codified requirements: specified data schema, data quality bar, licensing needs, and the bid price.
Another dimension that is relevant to Factual's current model: data as a currency. Some of our most interesting partnerships are based on an open exchange of information. Partners access our data and also contribute back streams of edits and other bulk data into our ecosystem. We highly value the contributions our partners make. "Currency" is a medium of exchange and a basis for accessing other scarce resources. In a world where not everyone is yet actively looking to license data, unique data is increasingly an important medium of exchange.
This interview was edited and condensed.
Photos: iTunes interface courtesy Apple, Inc; Software Development LifeCycle Templates By Phase Spreadsheet by Ivan Walsh, on Flickr
Related:
Video: Hjalmar Gislason on data discovery and search
Data markets aren't coming. They're already here
The black market for data
Data
datamarket
digitalcontent
itunes
from google
Elbaz spoke about some of the hurdles that startups face with big data — not just the question of storage, but the question of access. But as he addressed the emerging data economy, Elbaz said we will likely see novel access methods and new marketplaces for data. Startups will be able to build value-added services on top of big data, rather than having to worry about gathering and storing the data themselves. "An iTunes for data," is how he described it.
So what would it mean to apply the iTunes model to data sales and distribution? I asked Elbaz to expand on his thoughts.
What problems does an iTunes model for data solve?
Gil Elbaz: One key framework that will catalyze data sharing, licensing and consumption will be an open data marketplace. It is a place where data can be programmatically searched, licensed, accessed, and integrated directly into a consumer application. One might call it the "eBay of data" or the "iTunes of data." iTunes might be the better metaphor because it's not just the content that is valuable, but also the convenience of the distribution channel and the ability to pay for only what you will consume.
How would an iTunes model for data address licensing and ownership?
Gil Elbaz: In the case of iTunes, in a single click I purchase a track, download it, establish licensing rights on my iPhone and up to four other authorized devices, and it's immediately integrated into my daily life. Similarly, the deepest value will come for a marketplace that, with a single click, allows a developer to license data and have it automatically integrated into their particular application development stack. That might mean having the data instantly accessible via API, automatically replicated to a MySQL server on EC2, synchronized at Database.com, or copied to Google App Engine.
An iTunes for data could be priced from a single record/entity to a complete dataset. And it could be licensed for single use, caching allowed for 24 hours, or perpetual rights for a specific application.
What needs to happen for us to move away from "buying the whole album" to buying the data equivalent of a single?
Gil Elbaz: The marketplace will eventually facilitate competitive bidding, which will bring the price down for developers. iTunes is based on a fairly simple set-pricing model. But, in a world of multiple data vendors with commodity data, only truly unique data will command a premium price. And, of course, we'll need great search technology to find the right data or data API based on the developer's codified requirements: specified data schema, data quality bar, licensing needs, and the bid price.
Another dimension that is relevant to Factual's current model: data as a currency. Some of our most interesting partnerships are based on an open exchange of information. Partners access our data and also contribute back streams of edits and other bulk data into our ecosystem. We highly value the contributions our partners make. "Currency" is a medium of exchange and a basis for accessing other scarce resources. In a world where not everyone is yet actively looking to license data, unique data is increasingly an important medium of exchange.
This interview was edited and condensed.
Photos: iTunes interface courtesy Apple, Inc; Software Development LifeCycle Templates By Phase Spreadsheet by Ivan Walsh, on Flickr
Related:
Video: Hjalmar Gislason on data discovery and search
Data markets aren't coming. They're already here
The black market for data
april 2011 by rahuldave
Good data cuts through the chaos in Haiti
april 2010 by rahuldave
As computer scientists and technologists, we're used to dealing with large numbers in the abstract. But expressed in human terms, the mind-boggling numbers of the Haiti earthquake -- 250,000 dead, 300,000 injured and more than 1 million people left homeless -- are hard to comprehend.
The recovery from a disaster of this magnitude presents some important tasks for information technology: coordination of effort, triaging those most in need, and getting good data into the hands of decision makers and aid workers.
Here's a partial list of aid, relief, and rescue organizations currently in Haiti, gleaned from Wikipedia:
An Argentine military field hospital.
The Red Cross/Crescent, in various forms.
The U.S. military.
Multiple U.N. agencies.
Remnants of the Haitian government.
The French navy.
Sri Lankan relief workers.
At least 2,000 rescuers from 43 different groups (along with 161 search dogs).
A wealth of collaborators like this presents unique challenges around information fusion. Unlike business competitors or opposing sides of a war, the different groups want to share as much information as possible to achieve their common goal.
Each organization has a produced a fairly detailed picture of the parts of Haiti they are interacting with. Each organization also wants to consume every other organization's detailed knowledge of the situation. To act effectively, they need to integrate that knowledge into a common operating picture that accurately models the situation on the ground yesterday, today, and tomorrow.
Better coordination through data
Our reaction to the earthquake was to try to help in the best way we knew how. We set up a publicly available instance of our Palantir Government product, already loaded with relevant data, for use by aid workers and organizations working in Haiti. Using relevant, open-source data, we started modeling a picture of what's going in Haiti.
Our first cut was to include the locations and names of collapsed buildings, internally displaced people (IDP) camps, and Misson 4636 SMS messages, among others. We also added in map layers that let us see what administrative zone any point on the map was located in.
Having mapped the data into this model, users have access to it through a suite of visualization, analysis, querying, and collaboration tools that allow them to get useful answers to practical questions. Here are some examples:
Which administrative sectors have had the most SMS requests for food in the past 24 hours?
What collapsed buildings are suspected to contain hazardous materials?
Are any IDP camps close enough to hazmat sites to warrant special precautions? Should residents be moved?
Next: Stay ahead of Haiti's rainy season
With the infrastructure of the country destroyed, Haiti's rain and hurricane season will be more dangerous than usual. Not only are the normal structures that protect citizens from the waters gone, but people have moved out of the ruins of Port-au-Prince to hastily constructed IDP camps, some of which are sitting in the flood plains of Haiti's waterways.
The essential question facing relief workers is: Which of the approximately 2,500 IDP camps are most at risk from flooding?
In a place like the United States, an earthquake response and recovery team could engage the services and expertise of the U.S. Geological Survey (USGS), which maintains the National Water Information System. No such luck in Haiti, where the closest thing to the USGS is the Centre National de l'Information Géo-Spatiale. A quick look at the organization's website shows it didn't really make it through the earthquake.
We decided to help out. Since we're starting from square one, we put together data from the Army Geospatial Center, the U.N., NOAA, Haiti-based NGOs, a number of academic papers, and even geo-tagged photos from Flickr. The time it took to integrate this data? About six hours. Time it took to do the analysis? About seven minutes. Amount of that work that is reusable? All of it.
The best way to improve this analysis is to add detailed information about flooding, gathered from the field. We're looking to get new conduits of information into the Haiti instance as the rains really pick up.
If you'd like to help us, we're accepting new data sources, analyses, and contact with
relief organizations.
Volunteers, supplies, and goodwill are only the raw ingredients to recovery. It's the efficient and timely application of those resources to Haiti's most pressing problems that will make recovery a reality.
Related:
How crowdsourcing helped Haiti's relief efforts
analysis
data
emergency
haiti
from google
The recovery from a disaster of this magnitude presents some important tasks for information technology: coordination of effort, triaging those most in need, and getting good data into the hands of decision makers and aid workers.
Here's a partial list of aid, relief, and rescue organizations currently in Haiti, gleaned from Wikipedia:
An Argentine military field hospital.
The Red Cross/Crescent, in various forms.
The U.S. military.
Multiple U.N. agencies.
Remnants of the Haitian government.
The French navy.
Sri Lankan relief workers.
At least 2,000 rescuers from 43 different groups (along with 161 search dogs).
A wealth of collaborators like this presents unique challenges around information fusion. Unlike business competitors or opposing sides of a war, the different groups want to share as much information as possible to achieve their common goal.
Each organization has a produced a fairly detailed picture of the parts of Haiti they are interacting with. Each organization also wants to consume every other organization's detailed knowledge of the situation. To act effectively, they need to integrate that knowledge into a common operating picture that accurately models the situation on the ground yesterday, today, and tomorrow.
Better coordination through data
Our reaction to the earthquake was to try to help in the best way we knew how. We set up a publicly available instance of our Palantir Government product, already loaded with relevant data, for use by aid workers and organizations working in Haiti. Using relevant, open-source data, we started modeling a picture of what's going in Haiti.
Our first cut was to include the locations and names of collapsed buildings, internally displaced people (IDP) camps, and Misson 4636 SMS messages, among others. We also added in map layers that let us see what administrative zone any point on the map was located in.
Having mapped the data into this model, users have access to it through a suite of visualization, analysis, querying, and collaboration tools that allow them to get useful answers to practical questions. Here are some examples:
Which administrative sectors have had the most SMS requests for food in the past 24 hours?
What collapsed buildings are suspected to contain hazardous materials?
Are any IDP camps close enough to hazmat sites to warrant special precautions? Should residents be moved?
Next: Stay ahead of Haiti's rainy season
With the infrastructure of the country destroyed, Haiti's rain and hurricane season will be more dangerous than usual. Not only are the normal structures that protect citizens from the waters gone, but people have moved out of the ruins of Port-au-Prince to hastily constructed IDP camps, some of which are sitting in the flood plains of Haiti's waterways.
The essential question facing relief workers is: Which of the approximately 2,500 IDP camps are most at risk from flooding?
In a place like the United States, an earthquake response and recovery team could engage the services and expertise of the U.S. Geological Survey (USGS), which maintains the National Water Information System. No such luck in Haiti, where the closest thing to the USGS is the Centre National de l'Information Géo-Spatiale. A quick look at the organization's website shows it didn't really make it through the earthquake.
We decided to help out. Since we're starting from square one, we put together data from the Army Geospatial Center, the U.N., NOAA, Haiti-based NGOs, a number of academic papers, and even geo-tagged photos from Flickr. The time it took to integrate this data? About six hours. Time it took to do the analysis? About seven minutes. Amount of that work that is reusable? All of it.
The best way to improve this analysis is to add detailed information about flooding, gathered from the field. We're looking to get new conduits of information into the Haiti instance as the rains really pick up.
If you'd like to help us, we're accepting new data sources, analyses, and contact with
relief organizations.
Volunteers, supplies, and goodwill are only the raw ingredients to recovery. It's the efficient and timely application of those resources to Haiti's most pressing problems that will make recovery a reality.
Related:
How crowdsourcing helped Haiti's relief efforts
april 2010 by rahuldave
related tags
americanrecoveryandreinvestmentact ⊕ analysis ⊕ arra ⊕ bigdata ⊕ bigdataanalytics ⊕ blog ⊕ bookmarketing ⊕ businessintelligence ⊕ civicapps ⊕ civichacker ⊕ cloudproviders ⊕ data ⊖ dataanalysis ⊕ databases ⊕ databusiness ⊕ dataconclusions ⊕ dataconference ⊕ datadecisions ⊕ datademocratization ⊕ datadesign ⊕ dataethics ⊕ datajouralism ⊕ datajournalism ⊕ datamarket ⊕ datapredictions ⊕ dataprocess ⊕ dataproduct ⊕ dataprojects ⊕ datascience ⊕ datascientists ⊕ datasift ⊕ datatool ⊕ datatools ⊕ datavisualization ⊕ digitalcontent ⊕ drivetrainapproach ⊕ ehrs ⊕ electronichealthrecords ⊕ emergency ⊕ facebook ⊕ freesoftware ⊕ governmentit ⊕ Gov_2.0 ⊕ hadoop ⊕ haiti ⊕ healthcare ⊕ healthit ⊕ himss ⊕ hitech ⊕ insight ⊕ interoperability ⊕ itunes ⊕ marketintelligence ⊕ meaningfuluse ⊕ media ⊕ medical ⊕ narrative ⊕ nicarinterview ⊕ nonrelationaldatabase ⊕ nosql ⊕ npr ⊕ opendata ⊕ opengovernment ⊕ opengovernmentdata ⊕ openmapping ⊕ opensource ⊕ operations ⊕ planningforbigdata ⊕ privacy ⊕ programming ⊕ Publishing ⊕ publishinganalytics ⊕ publishingdata ⊕ schoolofdata ⊕ scientificmethod ⊕ scientists ⊕ social ⊕ socialapplications ⊕ startup ⊕ statistics ⊕ strataconference ⊕ strataweek ⊕ twitter ⊕ twitterarchive ⊕ userdata ⊕ visualization ⊕ visualizationoftheweek ⊕ visualizationprocess ⊕ webops ⊕ Web_2.0 ⊕ Web_Ops_&_Performance ⊕ windmap ⊕Copy this bookmark: