Exploring Complexity: We Need to Talk About Scaling (Melanie Mitchell)
december 2011 by arsyed
"In my next several blog posts I want to talk about scaling, especially about the very recent controversies surrounding claims of power-law scaling of particular phenomena [...] All this is going to require some forays into the wild and unruly land of statistics and data analysis. My goal in the next series of posts is to make sense of the following quite important papers in complex systems, which, taken together, form a kind of mini-course on scaling. Understanding ideas from these papers is essential in one’s education as a complex-systems scientist or informed “consumer” of this field."
complexity
scaling
power-law
via:cshalizi
december 2011 by arsyed
GraphLab: A New Parallel Framework for Machine Learning
june 2011 by arsyed
"Existing high-level parallel abstractions like MapReduce are often insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance."
machine-learning
parallel
scaling
june 2011 by arsyed
A Nice Introduction to Logistic Regression (Yi Wang)
april 2011 by arsyed
"A C++ implementation of large-scale logistic regression (together with a tech-report) can be found at:
http://stat.rutgers.edu/~madigan/BBR
A Mahout slides show that they have received a proposal to implement logistic regression in Hadoop from Google Summer school of Code, but I have not seen the result yet.
Two papers on large-scale logistic regression was published in 2009:
1. Parallel Large-scale Feature Selection for Logistic Regression, and
2. Large-scale Sparse Logistic Regression"
statistics
statcomp
scaling
logistic-regression
http://stat.rutgers.edu/~madigan/BBR
A Mahout slides show that they have received a proposal to implement logistic regression in Hadoop from Google Summer school of Code, but I have not seen the result yet.
Two papers on large-scale logistic regression was published in 2009:
1. Parallel Large-scale Feature Selection for Logistic Regression, and
2. Large-scale Sparse Logistic Regression"
april 2011 by arsyed
Anatomy of a Crushing (Pinboard Blog)
march 2011 by arsyed
"And a final, special shout-out goes to my favorite company in the world, Yahoo. I can't wait to see what you guys think of next!"
pinboard
delicious
architecture
scaling
postmorten
via:jacobian
march 2011 by arsyed
Does StackOverflow use caching and if so, how? - Meta Stack Overflow
january 2011 by arsyed
"In our (admittedly limited) experience, Redis is so fast that the slowest part of a cache lookup is the time spent reading and writing bytes to the network. This is not surprising, really, if you think about it."
stackOverflow
architecture
caching
redis
scaling
january 2011 by arsyed
Let the microblogs bloom (Russell Beattie)
december 2010 by arsyed
"Here's how a microblog system has to work to scale: All the messages created by users have to go into a Queue when they're created, and an external process then has to go through one by one and figure out which messages go into which subscriber's message list. As the system grows and more messages are created, the messages may arrive in your "inbox" slower, but they will still arrive. This type of system can be easily broken up into dedicated servers and multiple processes can handle different parts of the read/write process, and the individual user message lists can be more easily cached - as once a page is created that contains messages, it doesn't change."
architecture
microblogging
twitter
scaling
december 2010 by arsyed
s4: distributed stream computing platform
november 2010 by arsyed
"S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data."
software
yahoo
stream-processing
scaling
november 2010 by arsyed
The problems with ACID, and how to fix them without going NoSQL (Daniel Abadi, Alexander Thomson)
september 2010 by arsyed
"In our opinion, the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues. Responsibility for atomicity, consistency and isolation is simply being pushed onto the developer. ... the problem with ACID is not that its guarantees are too strong (and that therefore scaling these guarantees in a shared-nothing cluster of machines is too hard), but rather that its guarantees are too weak, and that this weakness is hindering scalability."
database
scaling
distributed
transactions
acid
isolation
deterministic
papers
september 2010 by arsyed
A Retrospective on SEDA (Matt Welsh)
july 2010 by arsyed
"If I were to design SEDA today, I would decouple stages (i.e., code modules) from queues and thread pools (i.e., concurrency boundaries)." ... "The most important contribution of SEDA, I think, was the fact that we made load and resource bottlenecks explicit in the application programming model."
server
swarch
scaling
seda
event-driven
concurrency
july 2010 by arsyed
All Velocity conference 2010 Slides/Notes (Royans Tharakan)
june 2010 by arsyed
"Here are all the slides/PDFs which I’ve come across from the first 2 days at velocity"
talks
videos
velocity
conference
scaling
june 2010 by arsyed
Problems with CAP, and Yahoo’s little known NoSQL system (Daniel Abadi)
may 2010 by arsyed
"In thinking about CAP the past few weeks, I feel that it has become overrated as a tool for explaining the design of modern scalable, distributed systems. Not only is the asymmetry of the contributions of C, A, and P confusing, but the lack of latency considerations in CAP significantly reduces its utility. To me, CAP should really be PACELC --- if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?"
database
distcomp
cap
latency
scaling
may 2010 by arsyed
Why Events Are A Bad Idea (for High-concurrency Servers) (Rob von Behren, Jeremy Condit, and Eric Brewer)
march 2010 by arsyed
"Event-based programming has been highly touted in recent years as the best way to write highly concurrent applications. Having worked on several of these systems, we now believe this approach to be a mistake. Specifically, we believe that threads can achieve all of the strengths of events, including support for high concurrency, low overhead, and a simple concurrency model. Moreover, we argue that threads allow a simpler and more natural programming style."
papers
concurrency
threading
scaling
events
via:shivak
march 2010 by arsyed
Server Design (Jeff Darcy)
june 2009 by arsyed
"The rest of this article is going to be centered around what I’ll call the Four Horsemen of Poor Performance: 1. Data copies 2. Context switches 3. Memory allocation 4. Lock contention"
programming
swarch
scaling
performance
bottlenecks
june 2009 by arsyed
Queue everything and delight everyone (l.m. orchard)
july 2008 by arsyed
"The idea here is that the social structure can help you scale, while still delighting people."
architecture
queueing
scaling
twitter
microblogging
july 2008 by arsyed
related tags
acid ⊕ activity-streams ⊕ algorithms ⊕ amazon ⊕ andrew-gelman ⊕ appEngine ⊕ architecture ⊕ async ⊕ availability ⊕ aws ⊕ backtype ⊕ benchmark ⊕ bigdata ⊕ blogs ⊕ books ⊕ bottlenecks ⊕ business ⊕ c10k ⊕ caching ⊕ cap ⊕ case ⊕ cdn ⊕ cloud ⊕ cloudComputing ⊕ comet ⊕ complexity ⊕ concurrency ⊕ conference ⊕ consistency ⊕ courses ⊕ critique ⊕ database ⊕ delicious ⊕ deterministic ⊕ digg ⊕ distcomp ⊕ distributed ⊕ django ⊕ ec2 ⊕ email ⊕ erlang ⊕ event-driven ⊕ events ⊕ eventual ⊕ facebook ⊕ failure ⊕ feeds ⊕ flickr ⊕ foursquare ⊕ google ⊕ graph ⊕ hadoop ⊕ hardware ⊕ httpd ⊕ image ⊕ isolation ⊕ keyValue ⊕ latency ⊕ libevent ⊕ linux ⊕ logistic-regression ⊕ machine-learning ⊕ memcached ⊕ memcachedb ⊕ metrics ⊕ microblogging ⊕ mmds ⊕ mongodb ⊕ mysql ⊕ net ⊕ netflix ⊕ node.js ⊕ nosql ⊕ numbers ⊕ outage ⊕ papers ⊕ parallel ⊕ pat-helland ⊕ patterns ⊕ performance ⊕ pinboard ⊕ post-mortem ⊕ postgresql ⊕ postmorten ⊕ power-law ⊕ programming ⊕ push ⊕ queueing ⊕ rails ⊕ rdbms ⊕ reddit ⊕ redis ⊕ regression ⊕ replication ⊕ scala ⊕ scaling ⊖ scribd ⊕ seda ⊕ server ⊕ servers ⊕ sherpa ⊕ simpleGeo ⊕ slides ⊕ software ⊕ sql ⊕ stackOverflow ⊕ statcomp ⊕ statistics ⊕ storage ⊕ stream-processing ⊕ swarch ⊕ talks ⊕ testing ⊕ threading ⊕ tips ⊕ transaction ⊕ transactions ⊕ twitter ⊕ use-cases ⊕ velocity ⊕ via:chl ⊕ via:cshalizi ⊕ via:jacobian ⊕ via:mcroydon ⊕ via:shivak ⊕ videos ⊕ web ⊕ webdev ⊕ yahoo ⊕Copy this bookmark: