pfctdayelise + distributed   15

Operating a Large, Distributed System in a Reliable Way: Practices I Learned
Oncall, Anomaly Detection & Alerting
Outages & Incident Management Processes
Postmortems, Incident Reviews & a Culture of Ongoing Improvements
Failover Drills, Capacity Planning & Blackbox Testing
SLOs, SLAs & Reporting on Them
SRE as an Independent Team
Reliability as an Ongoing Investment
Further Recommended Reading
devops  monitoring  metrics  softwaredev  distributed 
july 2019 by pfctdayelise
Pactflow | Distributed systems testing made easy
The first contract-testing platform for collaborating on and testing distributed systems
testing  distributed  pact  contracttesting 
may 2019 by pfctdayelise
Patterns of resilience
In this slide deck, I first describe what resilience is, what it is about, why it is important and how it is different from traditional stability approaches.

After that introductory part the main part is a "small" pattern language which is organized around isolation, the typical starting point of resilient software design. I used quotation marks for "small" as even this subset of a complete resilience pattern language still consists of around 20 patterns.

All the patterns are briefly described and for some of the patterns I added a bit of detail, but as this is a slide deck, the voice track - as usual - is missing. Also this pattern language is still sort of work in progress, i.e., it has not yet settled and some details are still missing. Yet I think (or at least hope), that the slides might contain a few useful insights for you.
resiliency  softwaredev  architecture  distributed 
may 2019 by pfctdayelise
Notes on Distributed Systems for Young Bloods – Something Similar
# Distributed systems are different because they fail often.
# Writing robust distributed systems costs more than writing robust single-machine systems.
# Robust, open source distributed systems are much less common than robust, single-machine systems.
# Coordination is very hard.
# If you can fit your problem in memory, it’s probably trivial.
# “It’s slow” is the hardest problem you’ll ever debug.
# Implement backpressure throughout your system.
# Find ways to be partially available.
# Metrics are the only way to get your job done.
# Use percentiles, not averages.
# Learn to estimate your capacity.
# Feature flags are how infrastructure is rolled out.
# Choose id spaces wisely.
# Exploit data-locality.
# Writing cached data back to persistent storage is bad.
# Computers can do more than you think they can.
# Use the CAP theorem to critique systems.
# Extract services.
march 2017 by pfctdayelise

Copy this bookmark: