jm + sre   10

The Yelp Production Engineering Documentation Style Guide
This is great! Also they correctly use the term "runbook" instead of "playbook" :)
Documentation is something that many of us in software and site reliability engineering struggle with – even if we recognize its importance, it can still be a struggle to write it consistently and to write it well. While we in Yelp’s Production Engineering group are no different, over the last few quarters we’ve engaged in a concerted effort to do something about it.

One of the first steps towards changing this process was developing our documentation style guide, something that started out as a Hackathon project late last year. I spoke about it when I was giving my talk on documentation at SRECon EMEA in August, and afterwards, a number of people reached out to ask if they could have a copy.

While what we’re sharing today isn’t our exact style guide – we’ve trimmed out some of the specifics that aren’t really relevant, done a bit of rewording for a more general audience, and added some annotations – it’s essentially the one we’ve been using since the start of this year, with the caveat that it’s a living document and continues to be refined. While this may not be perfect for every team (both at Yelp and elsewhere), it’s helped us raise the bar on our own documentation and provides an example for others to follow.
yelp  pe  sre  ops  engineering  documentation  srecon  chastity-blackwell  processes 
22 days ago by jm
Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites
Solid article proselytising runbooks/playbooks (or in this article's parlance, "Incident Models") for dev/ops handover and operational knowledge
ops  process  sre  devops  runbooks  playbooks  incident-models 
april 2017 by jm
Annotated tenets of SRE
A google SRE annotates the Google SRE book with his own thoughts. The source material is great, but the commentary improves it alright.

Particularly good for the error budget concept.

Also: when did "runbooks" become "playbooks"? Don't particularly care either way, but needless renaming is annoying.
runbooks  playbooks  ops  google  sre  error-budget 
march 2017 by jm
Google - Site Reliability Engineering
The Google SRE book is now online, for free
sre  google  ops  books  reading 
january 2017 by jm
'Jupiter rising: A decade of Clos topologies and centralized control in Google’s datacenter networks'
Love the 'decade of' dig at FB and Amazon -- 'we were doing it first' ;)

Great details on how Google have built out and improved their DC networking. Includes a hint that they now use DCTCP (datacenter-optimized TCP congestion control) on their internal hosts....
datacenter  google  presentation  networks  networking  via:irldexter  ops  sre  clos-networks  fabrics  switching  history  datacenters 
october 2016 by jm
My Philosophy on Alerting
'based my observations while I was a Site Reliability Engineer at Google', courtesy of Rob Ewaschuk <rob@infinitepigeons.org>. Seem pretty reasonable
monitoring  sysadmin  alerting  alerts  nagios  pager  ops  sre  rob-ewaschuk 
july 2016 by jm
Review: Site Reliability Engineering
John "lusis" Vincent reviews the SRE book, not 100% positively
sre  books  reading  reviews  lusis 
april 2016 by jm
Dan Luu reviews the Site Reliability Engineering book
voluminous! still looks great, looking forward to reading our copy (via Tony Finch)
via:fanf  books  reading  devops  ops  google  sre  dan-luu 
april 2016 by jm
Wired on the new O'Reilly SRE book
"Site Reliability Engineering: How Google Runs Production Systems", by Chris Jones, Betsy Beyer, Niall Richard Murphy, Jennifer Petoff. Go Niall!
google  sre  niall-murphy  ops  devops  oreilly  books  toread  reviews 
april 2016 by jm
interview with Google VP of SRE Ben Treynor
interviewed by Niall Murphy, no less ;). Some good info on what Google deems important from an ops/SRE perspective
sre  ops  devops  google  monitoring  interviews  ben-treynor 
may 2014 by jm

Copy this bookmark:



description:


tags: