jm + alerting   10

'Monitoring Cloudflare's planet-scale edge network with Prometheus' (preso)
from SRECon EMEA 2017; how Cloudflare are replacing Nagios with Prometheus and grafana
metrics  monitoring  alerting  prometheus  grafana  nagios 
11 weeks ago by jm
My Philosophy on Alerting
'based my observations while I was a Site Reliability Engineer at Google', courtesy of Rob Ewaschuk <rob@infinitepigeons.org>. Seem pretty reasonable
monitoring  sysadmin  alerting  alerts  nagios  pager  ops  sre  rob-ewaschuk 
july 2016 by jm
AWS Api Gateway for Fun and Profit
good worked-through example of an API Gateway rewriting system
api-gateway  aws  api  http  services  ops  alerting  alarming  opsgenie  signalfx 
december 2015 by jm
Alarm design: From nuclear power to WebOps
Imagine you are an operator in a nuclear power control room. An accident has started to unfold. During the first few minutes, more than 100 alarms go off, and there is no system for suppressing the unimportant signals so that you can concentrate on the significant alarms. Information is not presented clearly; for example, although the pressure and temperature within the reactor coolant system are shown, there is no direct indication that the combination of pressure and temperature mean that the cooling water is turning into steam. There are over 50 alarms lit in the control room, and the computer printer registering alarms is running more than 2 hours behind the events.

This was the basic scenario facing the control room operators during the Three Mile Island (TMI) partial nuclear meltdown in 1979. The Report of the President’s Commission stated that, “Overall, little attention had been paid to the interaction between human beings and machines under the rapidly changing and confusing circumstances of an accident” (p. 11). The TMI control room operator on the day, Craig Faust, recalled for the Commission his reaction to the incessant alarms: “I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information”. It was the first major illustration of the alarm problem, and the accident triggered a flurry of human factors/ergonomics (HF/E) activity.


A familiar topic for this ex-member of the Amazon network monitoring team...
ergonomics  human-factors  ui  ux  alarms  alerts  alerting  three-mile-island  nuclear-power  safety  outages  ops 
november 2015 by jm
Backstage Blog - Prometheus: Monitoring at SoundCloud - SoundCloud Developers
whoa, this is pretty excellent. The major improvement over a graphite-based system would be the multi-dimensional tagging of metrics, which we currently have to do by simply expanding the graphite metric's name to encompass all those dimensions and use searching at query time, inefficiently.
monitoring  soundcloud  prometheus  metrics  service-metrics  graphite  alerting 
february 2015 by jm
Logentries Announces Machine Learning Analytics for IT Ops Monitoring and Real-time Alerting
This sounds pretty neat:
With Logentries Anomaly Detection, users can:

Set-up real-time alerting based on deviations from important patterns and log events.
Easily customize Anomaly thresholds and compare different time periods.

With Logentries Inactivity Alerting, users can:

Monitor standard, incoming events such as an application heart beat.
Receive real-time alerts based on log inactivity (i.e. receive alerts when something does not occur).
logging  syslog  logentries  anomaly-detection  ops  machine-learning  inactivity  alarms  alerting  heartbeats 
august 2014 by jm
Metrics-Driven Development
we believe MDD is equal parts engineering technique and cultural process. It separates the notion of monitoring from its traditional position of exclusivity as an operations thing and places it more appropriately next to its peers as an engineering process. Provided access to real-time production metrics relevant to them individually, both software engineers and operations engineers can validate hypotheses, assess problems, implement solutions, and improve future designs.


Broken down into the following principles: 'Instrumentation-as-Code', 'Single Source of Truth', 'Developers Curate Visualizations and Alerts', 'Alert on What You See', 'Show me the Graph', 'Don’t Measure Everything (YAGNI)'.

We do all of these at Swrve, naturally (a technique I happily stole from Amazon).
metrics  coding  graphite  mdd  instrumentation  yagni  alerting  monitoring  graphs 
july 2014 by jm
My Philosophy on Alerting
'based on my observations while I was a Site Reliability Engineer at Google.' - by Rob Ewaschuk; very good, and matching the similar recommendations and best practices at Amazon for that matter
monitoring  ops  devops  alerting  alerts  pager-duty  via:jk 
may 2013 by jm
The first pillar of agile sysadmin: We alert on what we draw
'One of [the] purposes of monitoring systems was to provide data to allow us, as engineers, to detect patterns, and predict issues before they become production impacting. In order to do this, we need to be capturing data and storing it somewhere which allows us to analyse it. If we care about it - if the data could provide the kind of engineering insight which helps us to understand our systems and give early warning - we should be capturing it. ' .... 'There are a couple of weaknesses in [Nagios' design]. Assuming we’ve agreed that if we care about a metric enough to want to alert on it then we should be gathering that data for analysis, and graphing it, then we already have the data upon which to base our check. Furthermore, this data is not on the machine we’re monitoring, so our checks don’t in any way add further stress to that machine.' I would add that if we are alerting on a different set of data from what we collect for graphing, then using the graphs to investigate an alarm may run into problems if they don't sync up.
devops  monitoring  deployment  production  sysadmin  ops  alerting  metrics 
march 2013 by jm

Copy this bookmark:



description:


tags: