jm + nagios   6

'Monitoring Cloudflare's planet-scale edge network with Prometheus' (preso)
from SRECon EMEA 2017; how Cloudflare are replacing Nagios with Prometheus and grafana
metrics  monitoring  alerting  prometheus  grafana  nagios 
22 days ago by jm
My Philosophy on Alerting
'based my observations while I was a Site Reliability Engineer at Google', courtesy of Rob Ewaschuk <rob@infinitepigeons.org>. Seem pretty reasonable
monitoring  sysadmin  alerting  alerts  nagios  pager  ops  sre  rob-ewaschuk 
july 2016 by jm
Applying cardiac alarm management techniques to your on-call
An ops-focused take on a recent story about alarm fatigue, and how a Boston hospital dealt with it. When I was in Amazon, many of the teams in our division had a target to reduce false positive pages, with a definite monetary value attached to it, since many teams had "time off in lieu" payments for out-of-hours pages to the on-call staff. As a result, reducing false-positive pages was reasonably high priority and we dealt with this problem very proactively, with a well-developed sense of how to do so. It's interesting to see how the outside world is only just starting to look into its amelioration. (Another benefit of a TOIL policy ;)
ops  monitoring  sysadmin  alerts  alarms  nagios  alarm-fatigue  false-positives  pages 
september 2014 by jm
10 Things We Forgot to Monitor
a list of not-so-common outage causes which are easy to overlook; swap rate, NTP drift, SSL expiration, fork rate, etc.
nagios  metrics  ops  monitoring  systems  ntp  bitly 
january 2014 by jm
The How and Why of Flapjack
Flapjack aims to be a flexible notification system that handles:

Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc);
Alert summarisation (with per-user, per media summary thresholds);
Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc).

Flapjack sits downstream of your check execution engine (like Nagios, Sensu, Icinga, or cron), processing events to determine if a problem has been detected, who should know about the problem, and how they should be told.
flapjack  notification  alerts  ops  nagios  paging  sensu 
january 2014 by jm
check_graphite
'a Nagios plugin to poll Graphite'. Necessary, since service metrics are the true source of service health information
nagios  graphite  service-metrics  ops 
january 2013 by jm

Copy this bookmark:



description:


tags: