jm + stripe   9

Fast and flexible observability with canonical log lines
Interesting -- basically crossing the line between service metrics and logging, with a simple, readable structured logging format, and a well-defined structure
stripe  logging  metrics  canonical-logs  structured-logs  ops  operability  observability 
july 2019 by jm
The secret life of DNS packets: investigating complex networks
ah, the joys of building production systems atop AWS and its poorly-documented limits:
One more surprising detail we discovered in the tcpdump data was that the VPC resolver was not sending back responses to many of the queries. During one of the 60-second collection periods the DNS server sent 257,430 packets to the VPC resolver. The VPC resolver replied back with only 61,385 packets, which averages to 1,023 packets per second. We realized we may be hitting the AWS limit for how much traffic can be sent to a VPC resolver, which is 1,024 packets per second per interface.
aws  limits  ops  stripe  vpc  dns  outages  postmortems  debugging 
may 2019 by jm
Reproducible research: Stripe’s approach to data science
This is intriguing -- using Jupyter notebooks to embody data analysis work, and ensure it's reproducible, which brings better rigour similarly to how unit tests improve coding. I must try this.
Reproducibility makes data science at Stripe feel like working on GitHub, where anyone can obtain and extend others’ work. Instead of islands of analysis, we share our research in a central repository of knowledge. This makes it dramatically easier for anyone on our team to work with our data science research, encouraging independent exploration.

We approach our analyses with the same rigor we apply to production code: our reports feel more like finished products, research is fleshed out and easy to understand, and there are clear programmatic steps from start to finish for every analysis.
stripe  coding  data-science  reproducability  science  jupyter  notebooks  analysis  data  experiments 
november 2016 by jm
Service discovery at Stripe
Writeup of their Consul-based service discovery system, a bit similar to smartstack. Good description of the production problems that they saw with Consul too, and also they figured out that strong consistency isn't actually what you want in a service discovery system ;)

HN comments are good too: https://news.ycombinator.com/item?id=12840803
consul  api  microservices  service-discovery  dns  load-balancing  l7  tcp  distcomp  smartstack  stripe  cap-theorem  scalability 
november 2016 by jm
Outage postmortem (2015-10-08 UTC) : Stripe: Help & Support
There was a breakdown in communication between the developer who requested the index migration and the database operator who deleted the old index. Instead of working on the migration together, they communicated in an implicit way through flawed tooling. The dashboard that surfaced the migration request was missing important context: the reason for the requested deletion, the dependency on another index’s creation, and the criticality of the index for API traffic. Indeed, the database operator didn’t have a way to check whether the index had recently been used for a query.


Good demo of how the Etsy-style chatops deployment approach would have helped avoid this risk.
stripe  postmortem  outages  databases  indexes  deployment  chatops  deploy  ops 
october 2015 by jm
Scaling email transparency
This is quite interesting/weird -- Stripe's protocol for mass-CCing email as they scale up the company, based around http://en.wikipedia.org/wiki/Civil_inattention
communication  culture  email  management  stripe  cc  transparency  civil-inattention 
december 2014 by jm
Game Day Exercises at Stripe: Learning from `kill -9`
We’ve started running game day exercises at Stripe. During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node, and ended up losing all data in the cluster. We were very surprised by this, but grateful to have found the problem in testing. This result and others from this exercise convinced us that game days like these are quite valuable, and we would highly recommend them for others.


Excellent post. Game days are a great idea. Also: massive Redis clustering fail
game-days  redis  testing  stripe  outages  ops  kill-9  failover 
october 2014 by jm
Big, Small, Hot or Cold - Your Data Needs a Robust Pipeline
'(Examples [of big-data B-I crunching pipelines] from Stripe, Tapad, Etsy & Square)'
stripe  tapad  etsy  square  big-data  analytics  kafka  impala  hadoop  hdfs  parquet  thrift 
february 2014 by jm

Copy this bookmark:



description:


tags: