cloudseer + shared + web_development   3

Building Scalable Databases: Are Relational Databases Compatible with Large Scale Websites?
A few weeks ago Todd Hoff over on the High Scalability blog penned a blog post titled MySQL
and Memcached: End of an Era? where he wrote



If you look at the early days of this blog, when web scalability was still in
its heady bloom of youth, many of the articles had to do with leveraging MySQL and memcached.
Exciting times. Shard MySQL to handle high write loads, cache objects in memcached
to handle high read loads, and then write a lot of glue code to make it all work together.
That was state of the art, that was how it was done. The architecture of many major
sites still follow this pattern today, largely because with enough elbow grease, it
works.



With a little perspective, it's clear the MySQL+memcached era is passing.



LinkedIn has moved on with their Project
Voldemort. Amazon went there a
while ago.

Digg declared their entrance into a new era in a post on their blog titled Looking
to the future with Cassandra,



Twitter has also declared their move in the article Cassandra
@ Twitter: An Interview with Ryan King.



Todd’s blog has been a useful source of information on the topic of scaling large
scale websites since he catalogues as many presentations as he can find from industry
leaders on how they’ve designed their systems to deal with millions to hundreds of
millions of users pounding their services a day. What he’s written above is really
an observation about industry trends and isn’t really meant to attack any technology.
I did find it interesting that many took it as an attack on memcached and/or relational
databases and came out swinging.

One post which I thought tried to take a balanced approach to rebuttal was Dennis
Forbes’ Getting
Real about NoSQL and the SQL-Isn't-Scalable Lie where he writes



I work in the financial industry. RDBMS’ and the Structured Query Language (SQL)
can be found at the nucleus of most of our solutions. The same was true when I worked
in the insurance, telecommunication, and power generation industries. So it piqued
my interest when a peer recently forwarded an article titled “The
end of SQL and relational databases”, adding the subject line “We’re
living in the past”. [Though as Michael Stonebraker points out, SQL
the query language actually has remarkably little to actually to do with the debate.
It would be more clearly called NoACID]



From a vertical scaling perspective — it’s the easiest and often the most computationally
effective way to scale (albeit being very inefficient from a cost perspective) — you
have the capacity to deploy your solution on >powerful
systems with armies of powerful cores, hundreds of GBs of memory, operating
against SAN arrays with ranks and ranks of SSDs.

The computational and I/O capacity possible on a single “machine” are positively
enormous. The storage system, which is the biggest limiting factor on most database
platforms, is ridiculously scalable, especially in the bold
new world of SSDs (or flash cards like the FusionIO).



From a horizontal scaling perspective you can partition
the data across many machines, ideally configuring each machine in a
failover cluster so you have complete redundancy and availability. With Oracle RAC
and Sybase ASE you can even add the classic clustering approach. Such a solution —
even on a stodgy old RDBMS — is scalable far beyond any real world need because you’ve
built a system for a large corporation, deployed in your own datacenter, with few
constraints beyond the limits of technology and the platform.

Your solution will cost hundreds of thousands of dollars (if not millions) to
deploy, but that isn’t a critical blocking point for most enterprises.This sort of
scaling that is at the heart of virtually every bank, trading system, energy platform,
retailing system, and so on.

To claim that SQL systems don’t scale, in defiance of such obvious and overwhelming
evidence, defies all reason.



There’s lots of good for food for thought in both blog posts. Todd is right that a
few large scale websites are moving beyond the horizontal scaling approach that Dennis
brought up in his rebuttal based on their experiences. What tends to happen once you’ve
built a partitioned/sharded SQL database architecture is that you tend to notice that
you’ve given up most of the features of an ACID relational database. You give up the
advantages of the relationships by eschewing foreign keys, triggers and joins since
these are prohibitively expensive to run across multiple databases. Denormalizing
the data means that you give up on Atomicity, Consistency and Isolation when updating
or retrieving results. And the end all you have left is that your data is Durable
(i.e. it is persistently stored) which isn’t much better than you get from a dumb
file system. Well, actually you also get to use SQL as your programming model which
is nicer than performing direct file I/O operations.

It is unsurprising that after being at this point for years, some people in our industry
have wondered whether it doesn’t make more sense to use data stores that are optimized
for the usage patterns of large scale websites instead of gloriously misusing relational
databases.  A good example of the tradeoffs is the blog post from the Digg team
on why they switched
to Cassandra. The database was already sharded which made performing joins to
calculate results of queries such as “which of my friends Dugg this item?” to be infeasible.
So instead they had to perform two reads from SQL (all Diggs on an item and all of
the user’s friends) then perform the intersection operation on the PHP front end code.
If the item was not already cached, this leads to disk I/O which could take seconds.
To make the situation worse, you actually want to perform this operation multiple
times on a single page view since it is reasonable to expect multiple Digg buttons
on a page if it has multiple stories on it.

An alternate approach is to denormalize the data and for each user store a list of
stories that have been Dugg by at least one of their friends. So whenever I Digg an
item, an entry is placed in each of my friends’ lists to indicate that story is now
one that has been Dugg by a friend. That way when the a friend of mine shows up, it
is a simple lookup to say “is this story ID on the list of stories Dugg by one of
their friends?” The challenge here is that it means Digging an item can result in
literally thousands of logical write operations. It has been traditionally prohibitively
expensive to incur such massive amounts of write I/O in relational databases
with all of their transactionality and enforcing of ACID constraints. NoSQL databases
like Cassandra which assume your data is denormalized are actually optimized for write
I/O heavy operations given the necessity of having to perform enormous amounts of
writes to keep data consistent.

Digg’s usage of Cassandra actually serves as a rebuttal to Dennis Forbes’ article
since they couldn’t feasibly get what they want with either horizontal or vertical
scaling of their relational database-based solution. I would argue that introducing
memcached into the mix would have addressed disk I/O concerns because all records
of who has Dugg an item could be stored in-memory so comparisons of which of my friends
have Dugg an item never have to go to disk to answer any parts of the query. The only
caveat with that approach is that RAM is more expensive than disk so you’ll need a
lot more servers to store 3 terabytes of data in memory than you would on disk.

However, the programming model is not the only factor one most consider when deciding
whether to stay with a sharded/partitioned relational database versus going with a
NoSQL solution. The other factor to consider is the actual management of the database
servers. The sorts of questions one has to ask when choosing a database solution are
listed in the interview
with Ryan King of Twitter where he lists the following checklist that they evaluated
before deciding to go with Cassandra over MySQL



We first evaluated them on their architectures by asking many questions along
the lines of:

How will we add new machines?

Are their any single points of failure?

Do the writes scale as well?

How much administration will the system require?

If its open source, is there a healthy community?

How much time and effort would we have to expend to deploy and integrate it?

Does it use technology which we know we can work with?



The problem with database sharding is that it isn’t really a supported out of the
box configuration for your traditional relational database product especially the
open source ones. How your system deals with new machines being added to the cluster
or handles machine failure often requires special case code being written by application
developers along with special hand holding by operations teams. Dealing with issues
related to database replication (whether it is multi-master or single
master) also often takes up unexpected amounts of manpower once sharding is involved.

For these reasons I expect we’ll see more large scale websites decide that instead
of treating a SQL database as a denormalized key-value pair store that they would
rather use a NoSQL database. However I also suspect that a lot of services who already
have a sharded relational database + in-memory cache solution can get a lot of mileage
from more judicious usage of in-memory caches before switching. This is especially
true given that you still caches in front of your NoSQL databases anyway. There’s
also the question of whether traditional relational database vendors will add features
to address the shortcomings highlighted by the NoSQL movement? Given that the sort
of companies adopting NoSQL are doing so because they want to save costs on software,
hardware and operations I somehow doubt that there is a lucrative marke[…]
Web_Development  shared  from google
march 2010 by cloudseer
PHP Creator Asks, Is Facebook's HipHop Just a "Nifty Trick"?
Earlier this week, we pinged PHP creator Rasmus Lerdorf about the implications of HipHop, the new PHP runtime that Facebook just open-sourced.

Using ReadWriteWeb as an example, he goes into great detail about how speeding up PHP isn't simply a matter of finding a single magic bullet. Lerdorf contends that true optimization comes down to attacking the mundane inefficiencies that sneak into sites of all sizes - even suggesting free tools for PHP devs to use - before a solution such as HipHop is considered.

Sponsor

Additional reporting by RWW journalist Mike Melanson.

But first, for an overview of what HipHop is (Facebook's rewrite of the PHP runtime) and what it does (translates PHP to C++ and compiles it with g++), check out our growing archive of HipHop posts or watch this video from Facebook:

Lerdorf begins his assessment of the project on a positive note. "I think it is a cool project and it will certainly be a good option for some sites," he wrote to us in an email.

"The effectiveness is going to depend a lot on the type of code it is used on. Like similar projects such as Psyco, Cython, Pypy and ShedSkin in the Python world, results vary greatly."

However, he continued to say that raw execution speed was "not a significant factor" for many applications. "Even if you double the execution speed of something that is 10% of your overall request cost, that is only a 5% overall improvement. If on every request you are hitting memcache/postgresql/mysql 10 times and spending a lot of time in system calls, don't expect miracles from HipHop."

Calling the code translator "a nifty trick," Lerdorf worries that some developers will see HipHop as "some kind of magic bullet" for site performance. Noting the amount of hype he's seen lately about the new runtime, Lerdorf wrote, "I'd love to see those same people get excited about basic profiling and identifying the most costly areas of an application. Speeding up one of the faster parts of your system isn't going to give you anywhere near as much of a benefit as speeding up, or eliminating, one of the slower parts of your overall system.

"People generally don't bother doing even the most basic site optimizations. PHP execution speed is usually quite far down the list if you sit down and profile the end-to-end series of HTTP requests that lead to the final page in a broswer..." At that point, Lerdorf was kind enough to give us a run-down of site issues with ReadWriteWeb.com that we dutifully passed on to our new and often beleaguered webmaster.

"These are relatively minor and very common things, but they all contribute to your users' perceived performance of your site. And that's just the frontend. If I sat down with a profiler and looked at your backend I predict I would find similar inefficiencies."

In a word, Lerdorf is telling us all to not get too excited about the "nifty trick" that is HipHop until the runtime is the biggest thing that's stopping a site from being as fast as possible. Given Facebook's engineering focus, we can imagine that the site's been pretty well optimized and HipHop gives enough efficiency to give users a faster front end while also using fewer server resources.

Lerdorf ended by suggesting Yahoo's YSlow and Google's Page Speed for analyzing front-end issues, and he recommended Valgrind's Callgrind for low-level back-end profiling and XDebug for userspace PHP profiling.

Discuss
Web_Development  shared  from google
february 2010 by cloudseer
Building Scalable Databases: Perspectives on the War on Soft Deletes
In the past few months I've noticed an increased number of posts questioning practices
around deleting and "virtually" deleting data from databases. Since some
of the concerns around this practice have to do with the impact of soft deletes on
scalability of a database-based application, I thought it would be a good topic for
my ongoing series on building scalable databases.

Soft Deletes 101: What is a soft delete and how does it differ from a hard delete?

Soft deleting an item from a database means that the row or entity is marked as deleted
but not physically removed from the database. Instead it is hidden from normal users
of the system but may be accessible by database or system administrators.

For example, let's consider this sample database of XBox 360 games I own

Name

Category

ESRB

GamespotScore

Company

Call of Duty: Modern Warfare 2

First Person Shooter

Mature

9.0

Infinity Ward

Batman: Arkham Asylum

Fantasy Action Adventure

Teen

9.0

Rocksteady Studios

Gears of War 2

Sci-Fi Shooter

Mature

9.0

Epic Games

Call of Duty 4: Modern Warfare

First Person Shooter

Mature

9.0

Infinity Ward

Soul Calibur IV

3D Fighting

Teen

8.5

Namco

Now consider what happens if I decide that I'm done with Call of Duty 4: Modern Warfare
now that I own Call of Duty: Modern Warfare 2. The expected thing to do would then
be to remove the entry from my database using a query such as



DELETE FROM games WHERE name='Call of Duty 4: Modern Warfare';



This is what is considered a "hard" delete.

But then what happens if my friends decide to use my list of games to decide which
games to get me for Christmas? A friend might not realize I'd previously owned the
game and might get it for me again. Thus it might be preferable if instead of deleting
items from the database they were removed from consideration as games I currently
own but still could be retrieved in special situations. To address this scenario I'd
add an IsDeleted column as shown below

Name

Category

ESRB

GamespotScore

Company

IsDeleted

Call of Duty: Modern Warfare 2

First Person Shooter

Mature

9.0

Infinity Ward

False

Batman: Arkham Asylum

Fantasy Action Adventure

Teen

9.0

Rocksteady Studios

False

Gears of War 2

Sci-Fi Shooter

Mature

9.0

Epic Games

False

Call of Duty 4: Modern Warfare

First Person Shooter

Mature

9.0

Infinity Ward

True

Soul Calibur IV

3D Fighting

Teen

8.5

Namco

False

Then for typical uses an application would interact with the following view of the
underlying table



CREATE VIEW current_games AS

SELECT Name, Category, ESRB, GameSpotScore, Company FROM games WHERE IsDeleted=False;



but when my friends ask me for a list of all of the games I have, I can provide the
full list of all the games I've ever owned from the original games table
if needed. Now that we understand how one would use soft deletes we can discuss the
arguments against this practice.

Rationale for War: The argument against soft deletes

Ayende Rahien makes a cogent argument against soft deletes in his post Avoid
Soft Deletes where he writes



One of the annoyances that we have to deal when building enterprise applications
is the requirement that no data shall be lost. The usual response to that is to introduce
a WasDeleted or an IsActive column in the database and implement deletes as an update
that would set that flag.

Simple, easy to understand, quick to implement and explain.

It is also, quite often, wrong.

The problem is that deletion of a row or an entity is rarely a simple event. It
effect not only the data in the model, but also the shape of the model. That is why
we have foreign keys, to ensure that we don’t end up with Order Lines that don’t have
a parent Order. And that is just the simplest of issues.

...

Let us say that we want to delete an order. What should we do? That is a business
decision, actually. But it is one that is enforced by the DB itself, keeping the data
integrity.

When we are dealing with soft deletes, it is easy to get into situations where
we have, for all intents and purposes, corrupt data, because Customer’s LastOrder
(which is just a tiny optimization that no one thought about) now points to a soft
deleted order.



Ayende is right that adding an IsDeleted flag mean that you can no longer take advantage
of database triggers for use when cleaning up database state when a deletion occurs.
This sort of cleanup now has to moved up into the application layer.

There is another set of arguments against soft deletes in Richard Dingwall's post
entitled The
Trouble with Soft Delete where he points out the following problems


Complexity

To prevent mixing active and inactive data in results, all queries must be made
aware of the soft delete columns so they can explicitly exclude them. It’s like a
tax; a mandatory WHERE clause to ensure you don’t return any deleted rows.

This extra WHERE clause is similar to checking return codes in programming languages
that don’t throw exceptions (like C). It’s very simple to do, but if you forget to
do it in even one place, bugs can creep in very fast. And it is background noise that
detracts away from the real intention of the query.

Performance

At first glance you might think evaluating soft delete columns in every query
would have a noticeable impact on performance. However, I’ve found that most RDBMSs
are actually pretty good at recognizing soft delete columns (probably because they
are so commonly used) and does a good job at optimizing queries that use them. In
practice, filtering inactive rows doesn’t cost too much in itself.

Instead, the performance hit comes simply from the volume of data that builds
up when you don’t bother clearing old rows. For example, we have a table in a system
at work that records an organisations day-to-day tasks: pending, planned, and completed.
It has around five million rows in total, but of that, only a very small percentage
(2%) are still active and interesting to the application. The rest are all historical;
rarely used and kept only to maintain foreign key integrity and for reporting purposes.

Interestingly, the biggest problem we have with this table is not slow read performance
but writes. Due to its high use, we index the table heavily to improve query performance.
But with the number of rows in the table, it takes so long to update these indexes
that the application frequently times out waiting for DML commands to finish.



These arguments seem less valid than Ayende's especially when the alternatives proposed
are evaluated. Let's look at the aforementioned problems and the proposed alternatives
in turn.

Trading the devil you know for the devil you don't: Thoughts on the alternatives
to soft deletes

Richard Dingwall argues that soft deletes add unnecessary complexity to the system
since all queries have to be aware of the IsDeleted column(s) in the database. As
I mentioned in my initial description of soft deletes this definitely does not have
to be the case. The database administrator can create views which the core application
logic interacts with (i.e. the current_games table in my example) so
that only a small subset of system procedures need to actually know that the soft
deleted columns even still exist in the database.

A database becoming so large that data manipulation becomes slow due to having to
update indexes is a valid problem. However Richard Dingwall's suggested alternative
excerpted below seems to trade one problem for a worse one


The memento pattern

Soft delete only supports undoing deletes, but the memento
pattern provides a standard means of handling all undo scenarios your
application might require.

It works by taking a snapshot of an item just before a change is made, and putting
it aside in a separate store, in case a user wants to restore or rollback later. For
example, in a job board application, you might have two tables: one transactional
for live jobs, and an undo log that stores snapshots of jobs at previous points in
time:



The problem I have with this solution is that if your database is already grinding
to a halt simply because you track which items are active/inactive in your database,
how much worse would the situation be if you now store every state transition in the
database as well? Sounds like you're trading one performance problem for a much worse
one.

The real problem seems to be that the database has gotten too big to be operated on
in an efficient manner on a single machine. The best way to address this is to partition
or shard the database. In fact, you could even choose to store all inactive records
on one database server and all active records on another. Those interested in database
sharding can take a look at a more
detailed discussion on database sharding I wrote earlier this year.

Another alternative proposed by both Ayende Rahien and Richard Dingwall is to delete
the data but use database triggers to write to an audit log in the cases where auditing
is the primary use case for keeping soft deleted entries in the database. This works
in the cases where the only reason for soft deleting entries is for
auditing purposes. However there are many real world situations where this is not
the case.

One use case for soft deleting is to provide an "undo" feature in an end
user application. For example, consider a user synchronizes the contact list on their
phone with one in the cloud (e.g. an iPhone or Windows Mobile/Windows Phone connecting
to Exchange or an Android phone connecting to Gmail). Imagine that the user now deletes
a contact from their phone because they do not have a phone number for the person
only to find out that person has also been deleted from their address book in the
cloud. At that point, an undo feature is desirable.

Other use cases could be the need to reactivate items that have been removed from
th[…]
Web_Development  shared  from google
november 2009 by cloudseer

Copy this bookmark:



description:


tags: