High Scalability - High Scalability - Google: Taming the Long Latency Tail - When More Machines Equals Worse Results


33 bookmarks. First posted by aaronmfraser march 2012.


Found this gem from a year ago: when more machines means worse results
from twitter
june 2013 by glenngillen
Interesting piece on how latency variability gets worse with scale - more machines equals worse results: http://t.co/zHzjcLwy
april 2012 by prenagha
Latency variablility increases with scale.
performance  web  2ll 
april 2012 by jeffhammond
These weren't in the talk, but I found them interesting and related, so why not toss them in?
scale  server  latency  share 
march 2012 by bowbaq
Likewise the current belief that, in the case of artificial machines the very large and the very small are equally feasible and lasting is a manifest error. Thus, for example, a small obelisk or column or other solid figure can certainly be laid down or set up without danger of breaking, while the large ones will go to pieces under the slightest provocation, and that purely on account of their own weight. -- Galileo

Galileo observed how things broke if they were naively scaled up. Interestingly, Google noticed a similar pattern when building larger software systems using the same techniques used to build smaller systems.
Google_Technology 
march 2012 by GameGamer43
Luiz André Barroso , Distinguished Engineer at Google, talks about this fundamental property of scaling systems in his fascinating talk, Warehouse-Scale Computing: Entering the Teenage Decade . Google found the larger the scale the greater the impact of latency variability. When a request is implemented by work done in parallel, as is common with today’s service oriented systems, the overall response time is dominated by the long tail distribution of the parallel operations. Every response must have a consistent and low latency or the overall operation response time will be tragically slow. The implication: high performance equals high tolerances, which means your entire system must be designed to exacting standards.
performance  google  programming  software  scaliablity  politics 
march 2012 by jtyost2
Monday, March 12, 2012 at 9:17AM Likewise the current belief that, in the case of artificial machines the very large and the very small are equally feasible and lasting is a manifest…
from readability
march 2012 by dlo
"The implication: high performance equals high tolerances, which means your entire system must be designed to exacting standards."
devops  google  performance  architecture 
march 2012 by mncaudill
Imagine a client making a request of a single web server. Ninety-nine times out of a hundred that request will be returned within an acceptable period of time. But one time out of hundred it may not. Say the disk is slow for some reason. If you look at the distribution of latencies, most of them are small, but there's one out on the tail end that's large. That's not so bad really. All it means is one customer gets a slightly slower response every once in a while.

Lets' change the example, now instead of one server you have 100 servers and a request will require a response from all 100 servers. That changes everything about your system's responsiveness. Suddenly the majority of queries are slow. 63% will take greater than 1 second. That's bad.
google  performance  devops 
march 2012 by ook
Taming the Long Latency Tail - When More Machines Equals Worse Results
devops  from twitter
march 2012 by ripienaar
Likewise the current belief that, in the case of artificial machines the very large and the very small are equally feasible and lasting is a manifest error. Thus, for example, a small obelisk or column or other solid figure can certainly be laid down or set up without danger of breaking, while the large ones will go to pieces under the slightest provocation, and that purely on account of their own weight. -- Galileo
Galileo observed how things broke if they were naively scaled up. Interestingly, Google noticed a similar pattern when building larger software systems using the same techniques used to build smaller systems. 

Luiz André Barroso, Distinguished Engineer at Google, talks about this fundamental property of scaling systems in his fascinating talk, Warehouse-Scale Computing: Entering the Teenage Decade. Google found the larger the scale the greater the impact of latency variability. When a request is implemented by work done in parallel, as is common with today's service oriented systems, the overall response time is dominated by the long tail distribution of the parallel operations. Every response must have a consistent and low latency or the overall operation response time will be tragically slow. The implication: high performance equals high tolerances, which means your entire system must be designed to exacting standards.

What is forcing a deeper look into latency variability is the advent of interactive real-time computing. Responsiveness becomes key. Good average response times aren't good enough. You simply can't naively scale up techniques to build larger systems. The reason is surprising and has deep implications on how we design service dominated systems:
Latency  Strategy  google  Architecture  Scalability  from google
march 2012 by aaronmfraser