jm + incident-response   7

The Time Our Provider Screwed Us
Good talk (with transcript) from Paul Biggar about what happened when CircleCI had a massive security incident, and how Jesse Robbins helped them do incident response correctly.

'On the left, Jesse pointed out that we needed an incident commander. That’s me, Paul. And this is very good, because I was a big proponent, I think lots of were around the 2013 mark, of flat organizational structures, and so I hadn’t really got a handle of this whole being in charge thing. The fact that someone else came in and said, “No, no, no, you are in charge”: extremely useful. And he also laid out the order of our priorities. Number one priority; safety of customers. Number two priority: communicate with customers. Number three priority: recovery of service.

I think a reasonable person could have put those in a different order, especially under the pressure and time constraints of the potential company-ending situation. So I was very happy to have those in order. If this is ever going to happen to you, I’d memorize them, maybe put it on an index card in your pocket, in case this ever happens.

The last thing he said is to make sure that we log everything, that we go slow, and that we code review and communicate. His point there is that if we’re going to bring our site back up, if we’re going to do all the things that we need to do in order to save our business and do the right thing for our customers and all that, we can’t be making quick, bad decisions. You can’t just upload whatever code is on your computer now, because I have to do this now, I have to fix it. So we set up a Slack channel … This was pre-Slack; it was a HipChat channel, where all of our communications went. Every single communication that we had about this went in that chatroom. Which came in extremely useful the next day, when I had to write a blog post that detailed exactly what had happened and all the steps that we did to fix it and remediate this, and I had an exact time stamps of all the things that had happened.'
incidents  incident-response  paul-biggar  circleci  security  communication  outages 
26 days ago by jm
Julia Evans on Twitter: "notes on this great "When the pager goes off" article"
'notes on this great "When the pager goes off" article from @incrementmag https://increment.com/on-call/when-the-pager-goes-off/ ' -- cartoon summarising a much longer article of common modern ops on-call response techniques. Still pretty consistent with the systems we used in Amazon
on-call  ops  incident-response  julia-evans  pager  increment-mag 
april 2017 by jm
PagerDuty Incident Response Documentation
This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process).


This is a really good set of processes -- quite similar to what we used in Amazon for high-severity outage response.
ops  process  outages  pagerduty  incident-response  incidents  on-call 
january 2017 by jm
Outages, PostMortems, and Human Error 101
Good basic pres from John Allspaw, covering the basics of tier-one tech incident response -- defining the 5 severity levels; root cause analysis techniques (to Five-Whys or not); and the importance of service metrics
devops  monitoring  ops  five-whys  allspaw  slides  etsy  codeascraft  incident-response  incidents  severity  root-cause  postmortems  outages  reliability  techops  tier-one-support 
april 2015 by jm
Stephanie Dean on event management and incident response
I asked around my ex-Amazon mates on twitter about good docs on incident response practices outside the "iron curtain", and they pointed me at this blog (which I didn't realise existed).

Stephanie Dean was the front-line ops manager for Amazon for many years, over the time where they basically *fixed* their availability problems. She since moved on to Facebook, Demonware, and Twitter. She really knows her stuff and this blog is FULL of great details of how they ran (and still run) front-line ops teams in Amazon.
ops  incident-response  outages  event-management  amazon  stephanie-dean  techops  tos  sev1 
october 2014 by jm
Adrian Cockroft's Cloud Outage Reports Collection
The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. [....] I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them.
outages  post-mortems  documentation  ops  aws  ec2  amazon  google  dropbox  microsoft  azure  incident-response 
march 2014 by jm
How to lose $172,222 a second for 45 minutes
Major outage and $465m of trading loss, caused by staggeringly inept software management: 8 years of incremental bitrot, technical debt, and failure to have correct processes to engage an ops team in incident response. Hopefully this will serve as a lesson that software is more than just coding, at least to one industry
trading  programming  coding  software  inept  fail  bitrot  tech-debt  ops  incident-response 
october 2013 by jm

Copy this bookmark:



description:


tags: