jm + incidents   5

The Time Our Provider Screwed Us
Good talk (with transcript) from Paul Biggar about what happened when CircleCI had a massive security incident, and how Jesse Robbins helped them do incident response correctly.

'On the left, Jesse pointed out that we needed an incident commander. That’s me, Paul. And this is very good, because I was a big proponent, I think lots of were around the 2013 mark, of flat organizational structures, and so I hadn’t really got a handle of this whole being in charge thing. The fact that someone else came in and said, “No, no, no, you are in charge”: extremely useful. And he also laid out the order of our priorities. Number one priority; safety of customers. Number two priority: communicate with customers. Number three priority: recovery of service.

I think a reasonable person could have put those in a different order, especially under the pressure and time constraints of the potential company-ending situation. So I was very happy to have those in order. If this is ever going to happen to you, I’d memorize them, maybe put it on an index card in your pocket, in case this ever happens.

The last thing he said is to make sure that we log everything, that we go slow, and that we code review and communicate. His point there is that if we’re going to bring our site back up, if we’re going to do all the things that we need to do in order to save our business and do the right thing for our customers and all that, we can’t be making quick, bad decisions. You can’t just upload whatever code is on your computer now, because I have to do this now, I have to fix it. So we set up a Slack channel … This was pre-Slack; it was a HipChat channel, where all of our communications went. Every single communication that we had about this went in that chatroom. Which came in extremely useful the next day, when I had to write a blog post that detailed exactly what had happened and all the steps that we did to fix it and remediate this, and I had an exact time stamps of all the things that had happened.'
incidents  incident-response  paul-biggar  circleci  security  communication  outages 
21 days ago by jm
GitLab.com Database Incident - 2017/01/31
Horrible, horrible postmortem doc. This is the kicker:
So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.


Reddit comments: https://www.reddit.com/r/linux/comments/5rd9em/gitlab_is_down_notes_on_the_incident_and_why_you/
devops  backups  cloud  outage  incidents  postmortem  gitlab 
february 2017 by jm
PagerDuty Incident Response Documentation
This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process).


This is a really good set of processes -- quite similar to what we used in Amazon for high-severity outage response.
ops  process  outages  pagerduty  incident-response  incidents  on-call 
january 2017 by jm
Outages, PostMortems, and Human Error 101
Good basic pres from John Allspaw, covering the basics of tier-one tech incident response -- defining the 5 severity levels; root cause analysis techniques (to Five-Whys or not); and the importance of service metrics
devops  monitoring  ops  five-whys  allspaw  slides  etsy  codeascraft  incident-response  incidents  severity  root-cause  postmortems  outages  reliability  techops  tier-one-support 
april 2015 by jm
GMail partial outage - Dec 10 2012 incident report [PDF]
TL;DR: a bad load balancer change was deployed globally, causing the impact. 21 minute time to detection. Single-location rollout is now on the cards
gmail  google  coe  incidents  postmortems  outages 
december 2012 by jm

Copy this bookmark:



description:


tags: