jm + autoremediation   3

Operating Apache Kafka Clusters 24/7 Without A Global Ops Team
Lyft built an autoremediation system and apparently it works :) Good to get a detailed writeup on such an elusive beast
autoremediation  failures  ops  kafka  scalability  automation 
10 weeks ago by jm
'Monitoring and detecting causes of failures of network paths', US patent 8,661,295 (B1)
The first software patent in my name -- couldn't avoid it forever :(
Systems and methods are provided for monitoring and detecting causes of failures of network paths. The system collects performance information from a plurality of nodes and links in a network, aggregates the collected performance information across paths in the network, processes the aggregated performance information for detecting failures on the paths, analyzes each of the detected failures to determine at least one root cause, and initiates a remedial workflow for the at least one root cause determined. In some aspects, processing the aggregated information may include performing a statistical regression analysis or otherwise solving a set of equations for the performance indications on each of a plurality of paths. In another aspect, the system may also include an interface which makes available for display one or more of the network topology, the collected and aggregated performance information, and indications of the detected failures in the topology.

The patent describes an early version of Pimms, the network failure detection and remediation system we built for Amazon.
amazon  pimms  swpats  patents  networking  ospf  autoremediation  outage-detection 
may 2014 by jm
Introducing Chaos to C*
Autoremediation, ie. auto-replacement, of Cassandra nodes in production at Netflix
ops  autoremediation  outages  remediation  cassandra  storage  netflix  chaos-monkey 
october 2013 by jm

Copy this bookmark: