jm + ops + codel   1

How Facebook avoids failures
Great paper from Ben Maurer of Facebook in ACM Queue.
A "move-fast" mentality does not have to be at odds with reliability. To make these philosophies compatible, Facebook's infrastructure provides safety valves.


This is full of interesting techniques.

* Rapidly deployed configuration changes: Make everybody use a common configuration system; Statically validate configuration changes; Run a canary; Hold on to good configurations; Make it easy to revert.

* Hard dependencies on core services: Cache data from core services. Provide hardened APIs. Run fire drills.

* Increased latency and resource exhaustion: Controlled Delay (based on the anti-bufferbloat CoDel algorithm -- this is really cool); Adaptive LIFO (last-in, first-out) for queue busting; Concurrency Control (essentially a form of circuit breaker).

* Tools that Help Diagnose Failures: High-Density Dashboards with Cubism (horizon charts); What just changed?

* Learning from Failure: the DERP (!) methodology,
ben-maurer  facebook  reliability  algorithms  codel  circuit-breakers  derp  failure  ops  cubism  horizon-charts  charts  dependencies  soa  microservices  uptime  deployment  configuration  change-management 
november 2015 by jm

Copy this bookmark:



description:


tags: