jm + autoremediation 3
Operating Apache Kafka Clusters 24/7 Without A Global Ops Team
10 weeks ago by jm
Lyft built an autoremediation system and apparently it works :) Good to get a detailed writeup on such an elusive beast
autoremediation
failures
ops
kafka
scalability
automation
10 weeks ago by jm
'Monitoring and detecting causes of failures of network paths', US patent 8,661,295 (B1)
may 2014 by jm
The first software patent in my name -- couldn't avoid it forever :(
The patent describes an early version of Pimms, the network failure detection and remediation system we built for Amazon.
amazon
pimms
swpats
patents
networking
ospf
autoremediation
outage-detection
Systems and methods are provided for monitoring and detecting causes of failures of network paths. The system collects performance information from a plurality of nodes and links in a network, aggregates the collected performance information across paths in the network, processes the aggregated performance information for detecting failures on the paths, analyzes each of the detected failures to determine at least one root cause, and initiates a remedial workflow for the at least one root cause determined. In some aspects, processing the aggregated information may include performing a statistical regression analysis or otherwise solving a set of equations for the performance indications on each of a plurality of paths. In another aspect, the system may also include an interface which makes available for display one or more of the network topology, the collected and aggregated performance information, and indications of the detected failures in the topology.
The patent describes an early version of Pimms, the network failure detection and remediation system we built for Amazon.
may 2014 by jm
Introducing Chaos to C*
october 2013 by jm
Autoremediation, ie. auto-replacement, of Cassandra nodes in production at Netflix
ops
autoremediation
outages
remediation
cassandra
storage
netflix
chaos-monkey
october 2013 by jm
Copy this bookmark: