ops   8459

« earlier    

Stack Overflow: How We Do Monitoring - 2018 Edition
interesting to see how the other half lives, as Stack Overflow is a .NET shop
logging  monitoring  stack-overflow  dotnet  ops  metrics 
22 hours ago by jm
Incident Response: Trade-offs Under Pressure
John Allspaw provides a glimpse into how other fields handle incident response, including active steps companies can take to support engineers in those uncertain and ambiguous scenarios. Examples include fields such as military, surgical trauma units, space transportation, aviation and air traffic control, and wildland firefighting.
response  incident  allspaw  ops 
4 days ago by hynek
Production Guideline
Checklists of things to check before deploying.
ops  releasemanagement  cloudnative  configurationmanagement 
6 days ago by cote
How Complex Systems Fail
This short paper by Richard Cook outlines the factors that contribute to failure of complex systems. Much of what is considered best practice by the operations teams of today can be linked to the ideas presented here.
paper  complexity  tech  techarch  ops 
7 days ago by billglover
Overload Control for Scaling WeChat Microservices
DAGOR is the load shedding strategy that has been employed in production at WeChat since 2013. What makes DAGOR interesting is that it is designed specifically to maintain user experience and fairness in times of systems overload. Traditional load shedding mechanisms have focussed on using a gateway to shed load at the edge. But these either ignore the dependent nature of requests in micro-service architectures or require expensive coordination across instances to track requests across services.

> "In this paper, we propose DAGOR, an overload control scheme designed for the account-oriented microservice architecture. DAGOR is service agnostic and system-centric. It manages overload at the microservice granule such that each microservice monitors its load status in real time and triggers load shedding in a collaborative manner among its relevant services when overload is detected."

The key conclusions are worth bearing in mind when considering load shedding for your architectures:

* Overload control in the large-scale microservice architecture must be decentralised and autonomous in each service.
* The algorithmic design of overload control should take into ac- count a variety of feedback mechanisms, rather than relying solely on the open-loop heuristics.
* An effective design of overload control is always derived from the comprehensive profiling of the processing behavior in the actual workload.

I've not seen much detail on the systems architecture behind some of the Chinese tech giants, but this paper also gives a little insight into the services architecture at WeChat.

Direct link to PDF: [socc18-final100.pdf](https://www.cs.columbia.edu/~ruigu/papers/socc18-final100.pdf)
Commentary: [the morning paper](https://blog.acolyer.org/2018/11/16/overload-control-for-scaling-wechat-microservices/)
techarch  tech  scalability  ops 
7 days ago by billglover
Google - Site Reliability Engineering
A great guide to getting started making prod engineering tractible and fun. Talks about concerns, desired outcomes well ...
insightful  mikedickerson  monitoring  observability  ops  devops  monthly 
8 days ago by cleskowsky
Founder Connie Brenton Resigns From CLOC, Citing 'Different Directions' | Legaltech News
The founder of the Corporate Legal Operations Consortium (CLOC), an organization that provides education and networking opportunities for legal operations professionals, has resigned.

Connie Brenton, the senior director of legal operations at NetApp Inc. and CLOC’s founder and now-former chairman of the board, resigned Wednesday. Executive team and board of directors member Jeff Franke, who was the assistant general counsel, legal operations, at the company formerly known as Yahoo Inc., also resigned from CLOC.
“This organization was put together with the passion and the commitment and the vision, the combined vision of all of the participants. And it has taken participants from the entire legal ecosystem,” Brenton said. “However, we are at a point now, we’re exactly three years old, we’re moving in different directions now. The board is more interested in moving the organization to a caretaker role versus that dynamic and growing organization and that isn’t as much fun for me.”
10 days ago by JordanFurlong
glibc changed their UTF-8 character collation ordering across versions, breaking postgres
This is terrifying:
Streaming replicas—and by extension, base backups—can become dangerously broken when the source and target machines run slightly different versions of glibc. Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on the slave. These indexes are sorted out of order with respect to the strcoll running on the slave. Because postgres is unaware of the discrepancy is uses these "corrupt" indexes to perform merge joins; merges rely heavily on the assumption that the indexes are sorted and this causes all the results of the join past the first poison pill entry to not be returned. Additionally, if the slave becomes master, the "corrupt" indexes will in cases be unable to enforce uniqueness, but quietly allow duplicate values.

Moral of the story -- keep your libc versions in sync across storage replication sets!
postgresql  scary  ops  glibc  collation  utf-8  characters  indexing  sorting  replicas  postgres 
11 days ago by jm
FFmpeg, SOX, Pandoc and RSVG for AWS Lambda
OK-ish way to add dependencies to your Lambda containers:
The basic AWS Lambda container is quite constrained, and until recently it was relatively difficult to include additional binaries into Lambda functions. Lambda Layers make that easy. A Layer is a common piece of code that is attached to your Lambda runtime in the /opt directory. You can reuse it in many functions, and deploy it only once. Individual functions do not need to include the layer code in their deployment packages, which means that the resulting functions are smaller and deploy faster. For example, at MindMup, we use Pandoc to convert markdown files into Word documents. The actual lambda function code is only a few dozen lines of JavaScript, but before layers, each deployment of the function had to include the whole Pandoc binary, larger than 100 MB. With a layer, we can publish Pandoc only once, so we use significantly less overall space for Lambda function versions. Each code change now requires just a quick redeployment.
serverless  lambda  dependencies  deployment  packaging  ops 
13 days ago by jm
AWS Service SLAs
The goal of this page is to high-light the lack of coverage AWS provides for its services across different security factors. These limitations are not well-understood by many. Further, the "Y" fields are meant to indicate that this service has any capability for the relevant factor. In many cases, this is not full coverage for the service, or there are exceptions or special cases.
amazon  aws  services  slas  ops  reliability 
14 days ago by jm
"Stop Rate Limiting! Capacity Management Done Right" by Jon Moore
You’d be hard pushed to deploy some APIs without talking about rate-limiting. A standard feature on all API gateways, many operations teams will require it as a pre-condition for going live. But is rate limiting the best way to manage access to limited capacity?

In this talk, Jon Moore, carefully sets out the case for an alternative approach. Stop rate limiting and start limiting concurrency instead. Interestingly he outlines an approach for doing this in a distributed fashion without the need for coordination across instances of an API gateway.
talk  performance  ops 
14 days ago by billglover

« earlier    

related tags

adityapatil  administration  allspaw  amazon  apt  architecture  article  automation  aws  backup  bash  benchmarks  book  build  building  cache  caching  career  chaos  characters  cheatsheet  cicd  clojure  cloudflare  cloudnative  collation  complexity  configuration  configurationmanagement  consul  container  containers  culture  debriefing  dependencies  deployment  design  devops  distributed  dns  docker  documentation  dotnet  down  downtime  duty  ec2  envoy_proxy  etsy  example  facebook  fail  failover  failure  free  fuckthisguyinparticular  generator  git  github  glibc  graphite  heroku  how-to  images  incident  incidents  indexing  infrastructure  insightful  instances  it  kafka  kinesis  kubernetes  lambda  leadership  linux  logging  machine-learning  makisu  marc-brooker  mesos  messaging  metrics  mikedickerson  ml  models  monitoring  monkey  monthly  myql  network  newsy  nginx  observability  oncall  operations  orchestration  outage  packaging  pager  pagerduty  pagers  paper  performance-reviews  performance  perl  pj:gitbabel  pocket  post-mortem  post-mortems  postgres  postgresql  postmortem  prdd  presentations  presto  prod  production  productivity  proxy  pubsub  qubole  queueing  reboot  releasemanagement  reliability  replicas  resiliancy  resilience  response  ruby  runbooks  samplecode  sauvegarde  scalability  scaling  scary  security  server  serverless  services  sharding  shards  slas  slides  software  sorting  sre  ssh  stack-overflow  statsd  support  sysadmin  system  systemmonitoring  talk  tech  techarch  terraform  test  training  triage  twitter  usat  utf-8  video  videogames  videos  warstory  web  workflows 

Copy this bookmark: