jm + root-cause   4

Outages, PostMortems, and Human Error 101
Good basic pres from John Allspaw, covering the basics of tier-one tech incident response -- defining the 5 severity levels; root cause analysis techniques (to Five-Whys or not); and the importance of service metrics
devops  monitoring  ops  five-whys  allspaw  slides  etsy  codeascraft  incident-response  incidents  severity  root-cause  postmortems  outages  reliability  techops  tier-one-support 
april 2015 by jm
Paper: "Root Cause Detection in a Service-Oriented Architecture" [pdf]
LinkedIn have implemented an automated root-cause detection system:

This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean
average precision in finding root causes compared to baseline and current state-of-the-art methods.


This is a topic close to my heart after working on something similar for 3 years in Amazon!

Looks interesting, although (a) I would have liked to see more case studies and examples of "real world" outages it helped with; and (b) it's very much a machine-learning paper rather than a systems one, and there is no discussion of fault tolerance in the design of the detection system, which would leave me worried that in the case of a large-scale outage event, the system itself will disappear when its help is most vital. (This was a major design influence on our team's work.)

Overall, particularly given those 2 issues, I suspect it's not in production yet. Ours certainly was ;)
linkedin  soa  root-cause  alarming  correlation  service-metrics  machine-learning  graphs  monitoring 
june 2013 by jm
First 5 Minutes Troubleshooting A Server
quite a good checklist of first steps for troubleshooting. Worth bookmarking for "dstat --top-io --top-bio" alone, which is an absolutely excellent tool and new to me
dstat  server  io  disks  hardware  performance  linux  sysadmin  ops  troubleshooting  checklists  root-cause 
march 2013 by jm

Copy this bookmark:



description:


tags: