My Philosophy on Alerting
'based my observations while I was a Site Reliability Engineer at Google', courtesy of Rob Ewaschuk <>. Seem pretty reasonable
monitoring  sysadmin  alerting  alerts  nagios  pager  ops  sre  rob-ewaschuk 
july 2016 by jm
Applying cardiac alarm management techniques to your on-call
An ops-focused take on a recent story about alarm fatigue, and how a Boston hospital dealt with it. When I was in Amazon, many of the teams in our division had a target to reduce false positive pages, with a definite monetary value attached to it, since many teams had "time off in lieu" payments for out-of-hours pages to the on-call staff. As a result, reducing false-positive pages was reasonably high priority and we dealt with this problem very proactively, with a well-developed sense of how to do so. It's interesting to see how the outside world is only just starting to look into its amelioration. (Another benefit of a TOIL policy ;)
ops  monitoring  sysadmin  alerts  alarms  nagios  alarm-fatigue  false-positives  pages 
september 2014 by jm
10 Things We Forgot to Monitor
a list of not-so-common outage causes which are easy to overlook; swap rate, NTP drift, SSL expiration, fork rate, etc.
nagios  metrics  ops  monitoring  systems  ntp  bitly 
january 2014 by jm
The How and Why of Flapjack
Flapjack aims to be a flexible notification system that handles:

Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc);
Alert summarisation (with per-user, per media summary thresholds);
Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc).

Flapjack sits downstream of your check execution engine (like Nagios, Sensu, Icinga, or cron), processing events to determine if a problem has been detected, who should know about the problem, and how they should be told.
flapjack  notification  alerts  ops  nagios  paging  sensu 
january 2014 by jm
'a Nagios plugin to poll Graphite'. Necessary, since service metrics are the true source of service health information
nagios  graphite  service-metrics  ops 
january 2013 by jm

