jm + paul-biggar   1

The Time Our Provider Screwed Us
Good talk (with transcript) from Paul Biggar about what happened when CircleCI had a massive security incident, and how Jesse Robbins helped them do incident response correctly.

'On the left, Jesse pointed out that we needed an incident commander. That’s me, Paul. And this is very good, because I was a big proponent, I think lots of were around the 2013 mark, of flat organizational structures, and so I hadn’t really got a handle of this whole being in charge thing. The fact that someone else came in and said, “No, no, no, you are in charge”: extremely useful. And he also laid out the order of our priorities. Number one priority; safety of customers. Number two priority: communicate with customers. Number three priority: recovery of service.

I think a reasonable person could have put those in a different order, especially under the pressure and time constraints of the potential company-ending situation. So I was very happy to have those in order. If this is ever going to happen to you, I’d memorize them, maybe put it on an index card in your pocket, in case this ever happens.

The last thing he said is to make sure that we log everything, that we go slow, and that we code review and communicate. His point there is that if we’re going to bring our site back up, if we’re going to do all the things that we need to do in order to save our business and do the right thing for our customers and all that, we can’t be making quick, bad decisions. You can’t just upload whatever code is on your computer now, because I have to do this now, I have to fix it. So we set up a Slack channel … This was pre-Slack; it was a HipChat channel, where all of our communications went. Every single communication that we had about this went in that chatroom. Which came in extremely useful the next day, when I had to write a blog post that detailed exactly what had happened and all the steps that we did to fix it and remediate this, and I had an exact time stamps of all the things that had happened.'
incidents  incident-response  paul-biggar  circleci  security  communication  outages 
21 days ago by jm

Copy this bookmark:



description:


tags: