jm + pimms   2

2015-02-19 GCE outage
40 minutes of multi-zone network outage for majority of instances.

'The internal software system which programs GCE’s virtual network for VM
egress traffic stopped issuing updated routing information. The cause of
this interruption is still under active investigation. Cached route
information provided a defense in depth against missing updates, but GCE VM
egress traffic started to be dropped as the cached routes expired.'

I wonder if Google Pimms fired the alarms for this ;)
google  outages  gce  networking  routing  pimms  multi-az  cloud 
february 2015 by jm
'Monitoring and detecting causes of failures of network paths', US patent 8,661,295 (B1)
The first software patent in my name -- couldn't avoid it forever :(
Systems and methods are provided for monitoring and detecting causes of failures of network paths. The system collects performance information from a plurality of nodes and links in a network, aggregates the collected performance information across paths in the network, processes the aggregated performance information for detecting failures on the paths, analyzes each of the detected failures to determine at least one root cause, and initiates a remedial workflow for the at least one root cause determined. In some aspects, processing the aggregated information may include performing a statistical regression analysis or otherwise solving a set of equations for the performance indications on each of a plurality of paths. In another aspect, the system may also include an interface which makes available for display one or more of the network topology, the collected and aggregated performance information, and indications of the detected failures in the topology.

The patent describes an early version of Pimms, the network failure detection and remediation system we built for Amazon.
amazon  pimms  swpats  patents  networking  ospf  autoremediation  outage-detection 
may 2014 by jm

Copy this bookmark: