highavailability   656

« earlier    

October 21 post-incident analysis | Hacker News
The timeline of events was interesting (and much appreciated), but the root cause analysis doesn't really go much deeper than "we had a brief network partition, and our systems weren't designed to cope with it", which still leaves a whole lot of question marks.
Of course, without detailed knowledge of how GitHub's internals work, all we can do is speculate. But just based on what was explained in this blog post, it sounds like they're replicating database updates asynchronously, without waiting for the updates to be acknowledged by slaves before the master allows them to commit. Which means the data on slaves is always slightly out-of-date, and becomes more out-of-date when the slaves are partitioned from the master. Which means that promoting a slave to master will by definition lose some committed writes.

If "guarding the confidentiality and integrity of user data is GitHub’s highest priority", then why would they build and deploy an automated failover system whose purpose is to preserve availability at the cost of consistency? And why were they apparently caught off-guard when it operated as designed?

(Reading point 1 under "technical initiatives", it seems that they consider intra-DC failover to be "safe", and cross-DC failover to be "unsafe". But the exact same failure mode is present in both cases; the only difference is the length of the time during which in-flight writes can be lost.)
github  scale  outage  hackernews  microsoft  mysql  ha  highavailability 
24 days ago by dentarg
Getting The Airlines Back On Their Feet After A Disaster | Information Security Buzz
on the importance of disaster recovery as part of resilient service design. in addition to high avail.
resilienceengineering  architecture  highavailability  disasterrecovery 
12 weeks ago by cleskowsky

« earlier    

related tags

2018  @4  algorithms  alwayson  api  architecture  articles  availability  availabilitygroups  aws  azure  blog  books  buffer  carp  chang  changemanagement  checklist  checklists  cloud  cloudcomputing  cluster  clustering  comments  comparison  computerscience  computing  configuration  couchbase  culster  database  databases  db  deployment  design  devops  dhcp  disasterrecovery  distributed  distributedsystems  dns  docker  documentation  dsc  engineering  erlang  errorhandling  esxi  events  failover  faulttollerance  firewall  flexlm  freebsd  github  gitlab  google  guide  ha  hackernews  hadr  haproxy  hashing  hbase  highperformance  hortonworks  hosting  howto  hrtimer  imageoptimization  index  influxdb  infrastructure  innodb  innodbcluster  interviewing  intro  ipsec  kafka  keepalived  kops  kubernetes  lefred  linux  loadbalancer  loadbalancing  mariadb  matlab  methods  microsoft  migrations  monitoring  msdn  mysql  mysqlrouter  netflix  networking  online  openstack  operations  outage  overview  pacemaker  performance  pfsense  post  postgres  postgresql  private  production  programming  prometheus  protocol  proxy  python  queue  rails  ratelimiting  rdbms  redendancy  redis  redundancy  reference  reliability  replication  resilienceengineering  resolution  route53  router  scalability  scale  scaling  sentinel  server  service  setup  sharding  sharepoint  software  spinnaker  sql  sqladmin  sqlserver  sqltact  sre  ssl  ssms  stack  stackoverflow  stonith  storage  strongswan  sysadmin  tech  technet  telegraf  testing  timer  tips  tokindle  tools  toread  tricks  tsql  tutorial  types  uptime  usecase  vcenterserver  vcsa  virtual  virtualization  vm  vmware  vrrp  vsphere  windows  witness  wordpress  wsfc 

Copy this bookmark:



description:


tags: