jm + resilience   7

'STELLA Report from the SNAFUcatchers Workshop on Coping With Complexity', March 14-16 2017
'A consortium workshop of high end techs reviewed postmortems to better understand how engineers cope with the complexity of anomalies (SNAFU and SNAFU catching episodes) and how to support them. These cases reveal common themes regarding factors that produce resilient performances. The themes that emerge also highlight opportunities to move forward.'

The 'Dark debt' concept is interesting here.
complexity  postmortems  dark-debt  technical-debt  resilience  reliability  systems  snafu  reports  toread  stella  john-allspaw 
4 days ago by jm
Should create a separate Hystrix Thread pool for each remote call?
Excellent advice on capacity planning and queueing theory, in the context of Hystrix. Should I use a single thread pool for all dependency callouts, or independent thread pools for each one?
threadpools  pooling  hystrix  capacity  queue-theory  queueing  queues  failure  resilience  soa  microservices 
may 2016 by jm
RobustIRC
'IRC without netsplits' using Raft consensus
raft  irc  netsplits  resilience  fault-tolerance 
november 2015 by jm
muxy
a proxy that mucks with your system and application context, operating at Layers 4 and 7, allowing you to simulate common failure scenarios from the perspective of an application under test; such as an API or a web application. If you are building a distributed system, Muxy can help you test your resilience and fault tolerance patterns.
proxy  distributed  testing  web  http  fault-tolerance  failure  injection  tcp  delay  resilience  error-handling 
september 2015 by jm
(SEC307) Building a DDoS-Resilient Architecture with AWS
good slides on a "web application firewall" proxy service, deployable as an auto-scaling EC2 unit
ec2  aws  ddos  security  resilience  slides  reinvent  firewalls  http  elb 
april 2015 by jm
Can Spark Streaming survive Chaos Monkey?
good empirical results on Spark's resilience to network/host outages in EC2
ec2  aws  emr  spark  resilience  ha  fault-tolerance  chaos-monkey  netflix 
march 2015 by jm
Weathering the Unexpected - ACM Queue
Failures happen, and resilience drills help organizations prepare for them.


Good write-up on Google's DiRT (Disaster Recovery Test) procedures, clearly based on Amazon's Gameday exercises. ;) See also http://queue.acm.org/detail.cfm?id=2371297 for a moderated discussion including Jesse Robbins and John Allspaw
game-day  tests  disaster-recovery  dirt  exercises  history  amazon  google  etsy  resilience  acm 
september 2012 by jm

Copy this bookmark:



description:


tags: