jm + on-call   3

Julia Evans on Twitter: "notes on this great "When the pager goes off" article"
'notes on this great "When the pager goes off" article from @incrementmag ' -- cartoon summarising a much longer article of common modern ops on-call response techniques. Still pretty consistent with the systems we used in Amazon
on-call  ops  incident-response  julia-evans  pager  increment-mag 
april 2017 by jm
PagerDuty Incident Response Documentation
This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process).

This is a really good set of processes -- quite similar to what we used in Amazon for high-severity outage response.
ops  process  outages  pagerduty  incident-response  incidents  on-call 
january 2017 by jm
Post-mortem for February 24th, 2010 outage - Google App Engine
extremely detailed; power outage in the primary DC resulted in a degraded fleet, and on-calls didn't have up-to-date on-call docs to respond correctly
google  gae  appengine  outages  post-mortems  multi-dc  reliability  distcomp  fleets  on-call  from delicious
march 2010 by jm

Copy this bookmark: