jm + billing   3

Spotting a million dollars in your AWS account · Segment Blog
You can easily split your spend by AWS service per month and call it a day. Ten thousand dollars of EC2, one thousand to S3, five hundred dollars to network traffic, etc. But what’s still missing is a synthesis of which products and engineering teams are dominating your costs. 

Then, add in the fact that you may have hundreds of instances and millions of containers that come and go. Soon, what started as simple analysis problem has quickly become unimaginably complex. 

In this follow-up post, we’d like to share details on the toolkit we used. Our hope is to offer up a few ideas to help you analyze your AWS spend, no matter whether you’re running only a handful of instances, or tens of thousands.

segment  money  costs  billing  aws  ec2  ecs  ops 
8 days ago by jm
Twilio Billing Incident Post-Mortem
At 1:35 AM PDT on July 18, a loss of network connectivity caused all billing redis-slaves to simultaneously disconnect from the master. This caused all redis-slaves to reconnect and request full synchronization with the master at the same time. Receiving full sync requests from each redis-slave caused the master to suffer extreme load, resulting in performance degradation of the master and timeouts from redis-slaves to redis-master.
By 2:39 AM PDT the host’s load became so extreme, services relying on redis-master began to fail. At 2:42 AM PDT, our monitoring system alerted our on-call engineering team of a failure in the Redis cluster. Observing extreme load on the host, the redis process on redis-master was misdiagnosed as requiring a restart to recover. This caused redis-master to read an incorrect configuration file, which in turn caused Redis to attempt to recover from a non-existent AOF file, instead of the binary snapshot. As a result of that failed recovery, redis-master dropped all balance data. In addition to forcing recovery from a non-existent AOF, an incorrect configuration also caused redis-master to boot as a slave of itself, putting it in read-only mode and preventing the billing system from updating account balances.

See also for antirez' response.

Here's the takeaways I'm getting from it:

1. network partitions happen in production, and cause cascading failures. this is a great demo of that.

2. don't store critical data in Redis. this was the case for Twilio -- as far as I can tell they were using Redis as a front-line cache for billing data -- but it's worth saying anyway. ;)

3. Twilio were just using Redis as a cache, but a bug in their code meant that the writes to the backing SQL store were not being *read*, resulting in repeated billing and customer impact. In other words, it turned a (fragile) cache into the authoritative store.

4. they should probably have designed their code so that write failures would not result in repeated billing for customers -- that's a bad failure path.

Good post-mortem anyway, and I'd say their customers are a good deal happier to see this published, even if it contains details of the mistakes they made along the way.
redis  caching  storage  networking  network-partitions  twilio  postmortems  ops  billing  replication 
july 2013 by jm

Copy this bookmark: