jm + alerts   9

Actual screenshot of the broken UX of the Hawaii ballistic missile alert system
"This is the screen that set off the ballistic missile alert on Saturday. The operator clicked the PACOM (CDW) State Only link. The drill link is the one that was supposed to be clicked."


This is terrible, terrible UX.
ux  ui  hawaii  alerting  alerts  testing  safety  fail 
january 2018 by jm
The likely user interface which led to Hawaii's false-alarm incoming-ballistic-missile alert on Saturday 2018-01-13
@supersat on Twitter:

"In case you're curious what Hawaii's EAS/WEA interface looks like, I believe it's similar to this. Hypothesis: they test their EAS authorization codes at the beginning of each shift and selected the wrong option."

This is absolutely classic enterprisey, government-standard web UX -- a dropdown template selection and an easily-misclicked pair of tickboxes to choose test or live mode.
testing  ux  user-interfaces  fail  eas  hawaii  false-alarms  alerts  nuclear  early-warning  human-error 
january 2018 by jm
My Philosophy on Alerting
'based my observations while I was a Site Reliability Engineer at Google', courtesy of Rob Ewaschuk <rob@infinitepigeons.org>. Seem pretty reasonable
monitoring  sysadmin  alerting  alerts  nagios  pager  ops  sre  rob-ewaschuk 
july 2016 by jm
Alarm design: From nuclear power to WebOps
Imagine you are an operator in a nuclear power control room. An accident has started to unfold. During the first few minutes, more than 100 alarms go off, and there is no system for suppressing the unimportant signals so that you can concentrate on the significant alarms. Information is not presented clearly; for example, although the pressure and temperature within the reactor coolant system are shown, there is no direct indication that the combination of pressure and temperature mean that the cooling water is turning into steam. There are over 50 alarms lit in the control room, and the computer printer registering alarms is running more than 2 hours behind the events.

This was the basic scenario facing the control room operators during the Three Mile Island (TMI) partial nuclear meltdown in 1979. The Report of the President’s Commission stated that, “Overall, little attention had been paid to the interaction between human beings and machines under the rapidly changing and confusing circumstances of an accident” (p. 11). The TMI control room operator on the day, Craig Faust, recalled for the Commission his reaction to the incessant alarms: “I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information”. It was the first major illustration of the alarm problem, and the accident triggered a flurry of human factors/ergonomics (HF/E) activity.


A familiar topic for this ex-member of the Amazon network monitoring team...
ergonomics  human-factors  ui  ux  alarms  alerts  alerting  three-mile-island  nuclear-power  safety  outages  ops 
november 2015 by jm
Should Airplanes Be Flying Themselves?
Excellent Vanity Fair article on the AF447 disaster, covering pilots' team-leadership skills, Clipper Skippers, Alternate Law, and autopilot design: 'There is an old truth in aviation that the reasons you get into trouble become the reasons you don’t get out of it.'

Also interesting:

'The best pilots discard the [autopilot] automation naturally when it becomes unhelpful, and again there appear to be some cultural traits involved. Simulator studies have shown that Irish pilots, for instance, will gleefully throw away their crutches, while Asian pilots will hang on tightly. It’s obvious that the Irish are right, but in the real world Sarter’s advice is hard to sell. The automation is simply too compelling. The operational benefits outweigh the costs. The trend is toward more of it, not less. And after throwing away their crutches, many pilots today would lack the wherewithal to walk.'

(via Gavin Sheridan)
airlines  automation  flight  flying  accidents  post-mortems  af447  air-france  autopilot  alerts  pilots  team-leaders  clipper-skippers  alternate-law 
november 2014 by jm
Applying cardiac alarm management techniques to your on-call
An ops-focused take on a recent story about alarm fatigue, and how a Boston hospital dealt with it. When I was in Amazon, many of the teams in our division had a target to reduce false positive pages, with a definite monetary value attached to it, since many teams had "time off in lieu" payments for out-of-hours pages to the on-call staff. As a result, reducing false-positive pages was reasonably high priority and we dealt with this problem very proactively, with a well-developed sense of how to do so. It's interesting to see how the outside world is only just starting to look into its amelioration. (Another benefit of a TOIL policy ;)
ops  monitoring  sysadmin  alerts  alarms  nagios  alarm-fatigue  false-positives  pages 
september 2014 by jm
Dead Man's Snitch
a cron job monitoring tool that keeps an eye on your periodic processes and notifies you when something doesn't happen. Daily backups, monthly emails, or cron jobs you need to monitor? Dead Man's Snitch has you covered. Know immediately when one of these processes doesn't work.


via Marc.
alerts  cron  monitoring  sysadmin  ops  backups  alarms 
april 2014 by jm
The How and Why of Flapjack
Flapjack aims to be a flexible notification system that handles:

Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc);
Alert summarisation (with per-user, per media summary thresholds);
Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc).

Flapjack sits downstream of your check execution engine (like Nagios, Sensu, Icinga, or cron), processing events to determine if a problem has been detected, who should know about the problem, and how they should be told.
flapjack  notification  alerts  ops  nagios  paging  sensu 
january 2014 by jm
My Philosophy on Alerting
'based on my observations while I was a Site Reliability Engineer at Google.' - by Rob Ewaschuk; very good, and matching the similar recommendations and best practices at Amazon for that matter
monitoring  ops  devops  alerting  alerts  pager-duty  via:jk 
may 2013 by jm

Copy this bookmark:



description:


tags: