The NoSQL movement
february 2012 by rahuldave
In a conversation last year, Justin Sheehy, CTO of Basho, described NoSQL as a movement,
rather than a technology. This description immediately felt right;
I've never been comfortable talking about NoSQL, which when taken
literally, extends from the minimalist Berkeley DB (commercialized
as Sleepycat,
now owned by Oracle) to the big iron HBase, with detours into software as
fundamentally different as Neo4J (a
graph database) and FluidDB
(which defies description).
But what does it mean to say that NoSQL is a movement rather
than a technology? We certainly don't see picketers outside
Oracle's headquarters. Justin said succinctly that NoSQL is a
movement for choice in database architecture. There is no single
overarching technical theme; a single technology would belie the
principles of the movement.
Think of the last 15 years of software development. We've gotten
very good at building large, database-backed applications. Many of
them are web applications, but even more of them aren't. "Software
architect" is a valid job description; it's a position to which
many aspire. But what do software architects do? They specify the
high-level design of applications: the front end, the APIs, the
middleware, the business logic — the back end? Well, maybe not.
Since the '80s, the dominant back end of business systems has been a
relational database, whether Oracle, SQL Server or DB2. That's not
much of an architectural choice. Those are all great products, but
they're essentially similar, as are all the other relational
databases. And it's remarkable that we've explored many architectural
variations in the design of clients, front ends, and middleware, on a
multitude of platforms and frameworks, but haven't until recently
questioned the architecture of the back end. Relational databases have
been a given.
Many things have changed since the advent of relational
databases:
We're dealing with much more data. Although advances in storage
capacity and CPU speed have allowed the databases to keep pace,
we're in a new era where size itself is an important part of the
problem, and any significant database needs to be distributed.
We require sub-second responses to queries. In the '80s, most
database queries could run overnight as batch jobs. That's no
longer acceptable. While some analytic functions can still run as
overnight batch jobs, we've seen the web evolve from static files
to complex database-backed sites, and that requires sub-second
response times for most queries.
We want applications to be up 24/7. Setting up redundant
servers for static HTML files is easy, but a database replication
in a complex database-backed application is another.
We're seeing many applications in which the database has to
soak up data as fast (or even much faster) than it processes
queries: in a logging application, or a distributed sensor
application, writes can be much more frequent than reads.
Batch-oriented ETL (extract, transform, and load) hasn't
disappeared, and won't, but capturing high-speed data flows is
increasingly important.
We're frequently dealing with changing data or with
unstructured data. The data we collect, and how we use it, grows
over time in unpredictable ways. Unstructured data isn't a
particularly new feature of the data landscape, since unstructured
data has always existed, but we're increasingly unwilling to force
a structure on data a priori.
We're willing to sacrifice our sacred cows. We know that
consistency and isolation and other properties are very valuable,
of course. But so are some other things, like latency and
availability and not losing data even if our primary server goes
down. The challenges of modern applications make us realize that
sometimes we might need to weaken one of these constraints in order
to achieve another.
These changing requirements lead us to different tradeoffs and
compromises when designing software. They require us to rethink
what we require of a database, and to come up with answers aside
from the relational databases that have served us well over the
years. So let's look at these requirements in somewhat more
detail.
Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.
Size, response, availability
It's a given that any modern application is going to be
distributed. The size of modern datasets is only one reason for
distribution, and not the most important. Modern applications
(particularly web applications) have many concurrent users who
demand reasonably snappy response. In their 2009 Velocity Conference talk, Performance Related
Changes and their User Impact, Eric Schurman and Jake Brutlag
showed results from independent research projects at Google and
Microsoft. Both projects demonstrated imperceptibly small increases
in response time cause users to move to another site; if response
time is over a second, you're losing a very measurable percentage
of your traffic.
If you're not building a web application — say you're doing
business analytics, with complex, time-consuming queries — the
world has changed, and users now expect business analytics to run
in something like real time. Maybe not the sub-second latency
required for web users, but queries that run overnight are no
longer acceptable. Queries that run while you go out for coffee are
marginal. It's not just a matter of convenience; the ability to run
dozens or hundreds of queries per day changes the nature of the
work you do. You can be more experimental: you can follow through
on hunches and hints based on earlier queries. That kind of
spontaneity was impossible when research went through the DBA at
the data warehouse.
Whether you're building a customer-facing application or doing
internal analytics, scalability is a big issue. Vertical
scalability (buy a bigger, faster machine) always runs into limits.
Now that the laws of physics have stalled Intel-architecture clock
speeds in the 3.5GHz range, those limits are more apparent than
ever. Horizontal scalability (build a distributed system with more
nodes) is the only way to scale indefinitely. You're scaling
horizontally even if you're only buying single boxes: it's been a
long time since I've seen a server (or even a high-end desktop)
that doesn't sport at least four cores. Horizontal scalability is
tougher when you're scaling across racks of servers at a colocation
facility, but don't be deceived: that's how scalability works in
the 21st century, even on your laptop. Even in your cell phone. We
need database technologies that aren't just fast on single servers:
they must also scale across multiple servers.
Modern applications also need to be highly available. That goes
without saying, but think about how the meaning of "availability"
has changed over the years. Not much more than a decade ago, a web
application would have a single HTTP server that handed out static
files. These applications might be data-driven; but "data driven"
meant that a batch job rebuilt the web site overnight, and user
transactions were queued into a batch processing system, again for
processing overnight. Keeping such a system running isn't terribly
difficult. High availability doesn't impact the database. If the
database is only engaged in batched rebuilds or transaction
processing, the database can crash without damage. That's the world
for which relational databases were designed. In the '80s, if your
mainframe ran out of steam, you got a bigger one. If it crashed,
you were down. But when databases became a living, breathing part
of the application, availability became an issue. There is no way
to make a single system highly available; as soon as any component
fails, you're toast. Highly available systems are, by nature,
distributed systems.
If a distributed database is a given, the next question is how
much work a distributed system will require. There are
fundamentally two options: databases that have to be distributed
manually, via sharding; and databases that are inherently
distributed. Relational databases are split between multiple hosts
by manual sharding, or determining how to partition the datasets
based on some properties of the data itself: for example, first
names starting with A-K on one server, L-Z on another. A lot of
thought goes into designing a sharding and replication strategy
that doesn't impair performance, while keeping the data relatively
balanced between servers. There's a third option that is
essentially a hybrid: databases that are not inherently
distributed, but that are designed so they can be partitioned
easily. MongoDB is an example
of a database that can be sharded easily (or even automatically);
HBase, Riak, and Cassandra are all inherently
distributed, with options to control how replication and
distribution work.
What database choices are viable when you need good interactive
response? There are two separate issues: read latency and write
latency. For reasonably simple queries on a database with
well-designed indexes, almost any modern database can give decent
read latency, even at reasonably large scale. Similarly, just about
all modern databases claim to be able to keep up with writes at
high-speed. Most of these databases, including HBase, Cassandra,
Riak, and CouchDB, write
data immediately to an append-only file, which is an extremely
efficient operation. As a result, writes are often significantly
faster than reads.
Whether any particular database can deliver the performance you
need depends on the nature of the application, and whether you've
designed the application in a way that uses the database
efficiently: in particular, the structure of queries, more than the
structure[…]
Data
databases
nonrelationaldatabase
nosql
planningforbigdata
from google
rather than a technology. This description immediately felt right;
I've never been comfortable talking about NoSQL, which when taken
literally, extends from the minimalist Berkeley DB (commercialized
as Sleepycat,
now owned by Oracle) to the big iron HBase, with detours into software as
fundamentally different as Neo4J (a
graph database) and FluidDB
(which defies description).
But what does it mean to say that NoSQL is a movement rather
than a technology? We certainly don't see picketers outside
Oracle's headquarters. Justin said succinctly that NoSQL is a
movement for choice in database architecture. There is no single
overarching technical theme; a single technology would belie the
principles of the movement.
Think of the last 15 years of software development. We've gotten
very good at building large, database-backed applications. Many of
them are web applications, but even more of them aren't. "Software
architect" is a valid job description; it's a position to which
many aspire. But what do software architects do? They specify the
high-level design of applications: the front end, the APIs, the
middleware, the business logic — the back end? Well, maybe not.
Since the '80s, the dominant back end of business systems has been a
relational database, whether Oracle, SQL Server or DB2. That's not
much of an architectural choice. Those are all great products, but
they're essentially similar, as are all the other relational
databases. And it's remarkable that we've explored many architectural
variations in the design of clients, front ends, and middleware, on a
multitude of platforms and frameworks, but haven't until recently
questioned the architecture of the back end. Relational databases have
been a given.
Many things have changed since the advent of relational
databases:
We're dealing with much more data. Although advances in storage
capacity and CPU speed have allowed the databases to keep pace,
we're in a new era where size itself is an important part of the
problem, and any significant database needs to be distributed.
We require sub-second responses to queries. In the '80s, most
database queries could run overnight as batch jobs. That's no
longer acceptable. While some analytic functions can still run as
overnight batch jobs, we've seen the web evolve from static files
to complex database-backed sites, and that requires sub-second
response times for most queries.
We want applications to be up 24/7. Setting up redundant
servers for static HTML files is easy, but a database replication
in a complex database-backed application is another.
We're seeing many applications in which the database has to
soak up data as fast (or even much faster) than it processes
queries: in a logging application, or a distributed sensor
application, writes can be much more frequent than reads.
Batch-oriented ETL (extract, transform, and load) hasn't
disappeared, and won't, but capturing high-speed data flows is
increasingly important.
We're frequently dealing with changing data or with
unstructured data. The data we collect, and how we use it, grows
over time in unpredictable ways. Unstructured data isn't a
particularly new feature of the data landscape, since unstructured
data has always existed, but we're increasingly unwilling to force
a structure on data a priori.
We're willing to sacrifice our sacred cows. We know that
consistency and isolation and other properties are very valuable,
of course. But so are some other things, like latency and
availability and not losing data even if our primary server goes
down. The challenges of modern applications make us realize that
sometimes we might need to weaken one of these constraints in order
to achieve another.
These changing requirements lead us to different tradeoffs and
compromises when designing software. They require us to rethink
what we require of a database, and to come up with answers aside
from the relational databases that have served us well over the
years. So let's look at these requirements in somewhat more
detail.
Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.
Size, response, availability
It's a given that any modern application is going to be
distributed. The size of modern datasets is only one reason for
distribution, and not the most important. Modern applications
(particularly web applications) have many concurrent users who
demand reasonably snappy response. In their 2009 Velocity Conference talk, Performance Related
Changes and their User Impact, Eric Schurman and Jake Brutlag
showed results from independent research projects at Google and
Microsoft. Both projects demonstrated imperceptibly small increases
in response time cause users to move to another site; if response
time is over a second, you're losing a very measurable percentage
of your traffic.
If you're not building a web application — say you're doing
business analytics, with complex, time-consuming queries — the
world has changed, and users now expect business analytics to run
in something like real time. Maybe not the sub-second latency
required for web users, but queries that run overnight are no
longer acceptable. Queries that run while you go out for coffee are
marginal. It's not just a matter of convenience; the ability to run
dozens or hundreds of queries per day changes the nature of the
work you do. You can be more experimental: you can follow through
on hunches and hints based on earlier queries. That kind of
spontaneity was impossible when research went through the DBA at
the data warehouse.
Whether you're building a customer-facing application or doing
internal analytics, scalability is a big issue. Vertical
scalability (buy a bigger, faster machine) always runs into limits.
Now that the laws of physics have stalled Intel-architecture clock
speeds in the 3.5GHz range, those limits are more apparent than
ever. Horizontal scalability (build a distributed system with more
nodes) is the only way to scale indefinitely. You're scaling
horizontally even if you're only buying single boxes: it's been a
long time since I've seen a server (or even a high-end desktop)
that doesn't sport at least four cores. Horizontal scalability is
tougher when you're scaling across racks of servers at a colocation
facility, but don't be deceived: that's how scalability works in
the 21st century, even on your laptop. Even in your cell phone. We
need database technologies that aren't just fast on single servers:
they must also scale across multiple servers.
Modern applications also need to be highly available. That goes
without saying, but think about how the meaning of "availability"
has changed over the years. Not much more than a decade ago, a web
application would have a single HTTP server that handed out static
files. These applications might be data-driven; but "data driven"
meant that a batch job rebuilt the web site overnight, and user
transactions were queued into a batch processing system, again for
processing overnight. Keeping such a system running isn't terribly
difficult. High availability doesn't impact the database. If the
database is only engaged in batched rebuilds or transaction
processing, the database can crash without damage. That's the world
for which relational databases were designed. In the '80s, if your
mainframe ran out of steam, you got a bigger one. If it crashed,
you were down. But when databases became a living, breathing part
of the application, availability became an issue. There is no way
to make a single system highly available; as soon as any component
fails, you're toast. Highly available systems are, by nature,
distributed systems.
If a distributed database is a given, the next question is how
much work a distributed system will require. There are
fundamentally two options: databases that have to be distributed
manually, via sharding; and databases that are inherently
distributed. Relational databases are split between multiple hosts
by manual sharding, or determining how to partition the datasets
based on some properties of the data itself: for example, first
names starting with A-K on one server, L-Z on another. A lot of
thought goes into designing a sharding and replication strategy
that doesn't impair performance, while keeping the data relatively
balanced between servers. There's a third option that is
essentially a hybrid: databases that are not inherently
distributed, but that are designed so they can be partitioned
easily. MongoDB is an example
of a database that can be sharded easily (or even automatically);
HBase, Riak, and Cassandra are all inherently
distributed, with options to control how replication and
distribution work.
What database choices are viable when you need good interactive
response? There are two separate issues: read latency and write
latency. For reasonably simple queries on a database with
well-designed indexes, almost any modern database can give decent
read latency, even at reasonably large scale. Similarly, just about
all modern databases claim to be able to keep up with writes at
high-speed. Most of these databases, including HBase, Cassandra,
Riak, and CouchDB, write
data immediately to an append-only file, which is an extremely
efficient operation. As a result, writes are often significantly
faster than reads.
Whether any particular database can deliver the performance you
need depends on the nature of the application, and whether you've
designed the application in a way that uses the database
efficiently: in particular, the structure of queries, more than the
structure[…]
february 2012 by rahuldave
Comprehensive notes from my three hour Redis tutorial
april 2010 by rahuldave
Last week I presented two talks at the inaugural NoSQL Europe conference in London. The first was presented with Matthew Wall and covered the ways in which we have been exploring NoSQL at the Guardian. The second was a three hour workshop on Redis, my favourite piece of software to have the NoSQL label applied to it.
I've written about Redis here before, and it has since earned a place next to MySQL/PostgreSQL and memcached as part of my default web application stack. Redis makes write-heavy features such as real-time statistics feasible for small applications, while effortlessly scaling up to handle larger projects as well. If you haven't tried it out yet, you're sorely missing out.
For the workshop, I tried to give an overview of each individual Redis feature along with detailed examples of real-world problems that the feature can help solve. I spent the past day annotating each slide with detailed notes, and I think the result makes a pretty good stand-alone tutorial. Here's the end result:
Redis tutorial slides and notes
In unrelated news, Nat and I both completed the first ever Brighton Marathon last weekend, in my case taking 4 hours, 55 minutes and 17 seconds. Sincere thanks to everyone who came out to support us - until the race I had never appreciated how important the support of the spectators is to keep going to the end. We raised £757 for the Have a Heart children's charity. Thanks in particular to Clearleft who kindly offered to match every donation.
brightonmarathon
guardian
marathon
nosql
redis
running
from google
I've written about Redis here before, and it has since earned a place next to MySQL/PostgreSQL and memcached as part of my default web application stack. Redis makes write-heavy features such as real-time statistics feasible for small applications, while effortlessly scaling up to handle larger projects as well. If you haven't tried it out yet, you're sorely missing out.
For the workshop, I tried to give an overview of each individual Redis feature along with detailed examples of real-world problems that the feature can help solve. I spent the past day annotating each slide with detailed notes, and I think the result makes a pretty good stand-alone tutorial. Here's the end result:
Redis tutorial slides and notes
In unrelated news, Nat and I both completed the first ever Brighton Marathon last weekend, in my case taking 4 hours, 55 minutes and 17 seconds. Sincere thanks to everyone who came out to support us - until the race I had never appreciated how important the support of the spectators is to keep going to the end. We raised £757 for the Have a Heart children's charity. Thanks in particular to Clearleft who kindly offered to match every donation.
april 2010 by rahuldave
Four short links: 17 March 2010
march 2010 by rahuldave
Common MySQL Queries -- a useful reference.
MySociety's Next 12 Months -- two new projects, FixMyTransport and "Project Fosbury". The latter is a more general tool to help people organise their own campaigns for change.
riak -- scalable key-value store with JSON interface. (via joshua on Delicious)
Notes from NoSQL Live Boston -- full of juicy nuggets of info from the NoSQL conference.
databases
events
gov20
mysociety
mysql
nosql
from google
MySociety's Next 12 Months -- two new projects, FixMyTransport and "Project Fosbury". The latter is a more general tool to help people organise their own campaigns for change.
riak -- scalable key-value store with JSON interface. (via joshua on Delicious)
Notes from NoSQL Live Boston -- full of juicy nuggets of info from the NoSQL conference.
march 2010 by rahuldave
Why Digg Digs Cassandra
march 2010 by rahuldave
Digg, the San Francisco-based social media company, is dropping MySQL and instead betting its future on Cassandra, an open-source data store. It’s just the latest sign of the growing popularity of the software, which was developed (and open sourced) by Facebook to search through its inbox. While Facebook has since backed off Cassandra, Digg plans to open source all its work on Cassandra and champion the software’s development and adoption.
In a blog post on the Digg blog, John Quinn, Digg’s VP of engineering, writes:
Perhaps our most significant infrastructure change is abandoning MySQL in favor of a NoSQL alternative. To someone like me who’s been building systems almost exclusively on relational databases for almost 20 years, this feels like a bold move.
What’s Wrong with MySQL?
Our primary motivation for moving away from MySQL is the increasing difficulty of building a high performance, write intensive, application on a data set that is growing quickly, with no end in sight. This growth has forced us into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead.
Digg is just the latest high-profile convert to the NoSQL world. Instead of using databases such as MySQL, many of the companies that deal in near-real-time information are opting for new kind of data stores — most of them open source, such as Cassandra and CouchDB.
Cassandra is roughly the open-source equivalent of Google’s Big Table. It was intended by Facebook to solve the problem of inbox search; the company needed something that was fast, reliable and had the ability to handle read and write requests at the same time. Messaging in an environment as heavily used as Facebook requires a system that can not only store data but also provide results for search queries at blazing fast speeds.
Stu Hood, the technical lead for the search team in the Email & Apps division of Rackspace, recently said:
I think that distributed databases solve a problem that a lot of companies with large datasets have had to solve independently in the past…Cassandra has an approach that hybridizes the Bigtable and Dynamo models, where a lot of its competitors chose to take one path or the other. Over the Bigtable clones, Cassandra has huge high-availability advantages, and no single point of failure (possible because of the eventually consistent approach). When compared to the Dynamo adherents, Cassandra has the advantage of a more advanced datamodel, allowing for a single “row” to contain billions of column/value pairs: enough to fill a machine. You also get efficient range queries for the top level key, and even within your values.
Data Presentations Cassandra Sigmod
View more presentations from jhammerb.
In a post last year, contributing writer Gary Orenstein pointed out that thanks to these attributes, Cassandra has potential applications beyond inbox search that include “recommendation engines, targeted advertising, and content search, particularly when you combine many concurrent inputs and output requests to the same data set.”
Digg is a prototypical application. The company tells me that it gets:
40 million visitors a month, who in turn account for roughly 500 million page views a month.
20,000 daily submissions
It also generates:
170,000 daily Diggs
19,000 comments
As these numbers suggest, there is a high amount of interaction between the system and its users. No wonder Digg digs Cassandra!
Related content from GigaOM Pro (sub req’d):
What Cloud Computing Can Learn From NoSQL.
@Not_for_Syndication
Cloud_Computing
Infrastructure
Om's_Posts
Cassandara
Digg
MySQL
NoSQL
from google
In a blog post on the Digg blog, John Quinn, Digg’s VP of engineering, writes:
Perhaps our most significant infrastructure change is abandoning MySQL in favor of a NoSQL alternative. To someone like me who’s been building systems almost exclusively on relational databases for almost 20 years, this feels like a bold move.
What’s Wrong with MySQL?
Our primary motivation for moving away from MySQL is the increasing difficulty of building a high performance, write intensive, application on a data set that is growing quickly, with no end in sight. This growth has forced us into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead.
Digg is just the latest high-profile convert to the NoSQL world. Instead of using databases such as MySQL, many of the companies that deal in near-real-time information are opting for new kind of data stores — most of them open source, such as Cassandra and CouchDB.
Cassandra is roughly the open-source equivalent of Google’s Big Table. It was intended by Facebook to solve the problem of inbox search; the company needed something that was fast, reliable and had the ability to handle read and write requests at the same time. Messaging in an environment as heavily used as Facebook requires a system that can not only store data but also provide results for search queries at blazing fast speeds.
Stu Hood, the technical lead for the search team in the Email & Apps division of Rackspace, recently said:
I think that distributed databases solve a problem that a lot of companies with large datasets have had to solve independently in the past…Cassandra has an approach that hybridizes the Bigtable and Dynamo models, where a lot of its competitors chose to take one path or the other. Over the Bigtable clones, Cassandra has huge high-availability advantages, and no single point of failure (possible because of the eventually consistent approach). When compared to the Dynamo adherents, Cassandra has the advantage of a more advanced datamodel, allowing for a single “row” to contain billions of column/value pairs: enough to fill a machine. You also get efficient range queries for the top level key, and even within your values.
Data Presentations Cassandra Sigmod
View more presentations from jhammerb.
In a post last year, contributing writer Gary Orenstein pointed out that thanks to these attributes, Cassandra has potential applications beyond inbox search that include “recommendation engines, targeted advertising, and content search, particularly when you combine many concurrent inputs and output requests to the same data set.”
Digg is a prototypical application. The company tells me that it gets:
40 million visitors a month, who in turn account for roughly 500 million page views a month.
20,000 daily submissions
It also generates:
170,000 daily Diggs
19,000 comments
As these numbers suggest, there is a high amount of interaction between the system and its users. No wonder Digg digs Cassandra!
Related content from GigaOM Pro (sub req’d):
What Cloud Computing Can Learn From NoSQL.
march 2010 by rahuldave
Copy this bookmark: