distributed-systems   2441

« earlier    

Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems
We highlight one often-overlooked cause of performance failure: limpware – “limping” hardware whose performance degrades significantly compared to its specification. We report anecdotes of degraded disks and network components seen in large-scale production. To measure the system-level impact of limpware, we assembled limpbench, a set of benchmarks that combine dataintensive load and limpware injections. We benchmark five cloud systems (Hadoop, HDFS, ZooKeeper, Cassandra, and HBase) and find that limpware can severely impact distributed operations, nodes, and an entire cluster. From this, we introduce the concept of limplock, a situation where a system progresses slowly due to the presence of limpware and is not capable of failing over to healthy components. We show how each cloud system that we analyze can exhibit operation, node, and cluster limplock. We conclude that many cloud systems are not limpware tolerant.
distributed-systems  papers 
12 days ago by foodbaby
302 Found
Working with Asynchronous Celery Tasks – lessons learned - Added August 14, 2018 at 02:31PM
celery  distributed-systems  python  read2of 
23 days ago by xenocid
Understanding Blockchain Fundamentals, Part 1: Byzantine Fault Tolerance
Understanding Blockchain Fundamentals, Part 1: Byzantine Fault Tolerance - Added June 19, 2018 at 11:57AM
blockchain  distributed-systems  read2of 
29 days ago by xenocid
Serf by HashiCorp
Serf is a decentralized solution for cluster membership, failure detection, and orchestration. Lightweight and highly available.
clustering  messaging  cluster  distributed-systems  decentralization  devops  devtools  discovery  distributed  google 
4 weeks ago by vrobin
Protocol aware recovery for consensus based storage
Within a replicated state machine system, there are three critical persistent data structures: the log, the snapshots, and the metainfo. The log maintains the history of commands, snapshots are used to allow garbage collection of the log and prevent it from growing indefinitely, and the metainfo contains critical metadata such as the log start index. Any of these could be corrupted due to storage faults. None of the current approaches analysed by the authors could correctly recover from such faults.
the-morning-paper  distributed-systems  software-development  adrian-colyer  computer-science  raft  paxos 
4 weeks ago by chriskrycho

« earlier    

related tags

adrian-colyer  akka  algorithms  api  architecture  article  blockchain  blog  books  cap-theorem  cap  celery  cloud  cluster  clustering  computer-science  concurrency  consensus  consis  consistency  courses  crdt  cryptography  data-engineering  data-stores  data-structures  database  dbms  decentralization  design  dev  devops  devtools  discovery  distributed-lock  distributed  distributedsystems  distribution  engineering  fallacies  filesystem  filesystems  fp  golang  google-cloud  google  gossip  graph  guide  hashing  id  illinois  interview  ios  jepsen  job-scheduling  jsq  kafka  kubernetes  linux  load-balancer  load-balancers  load-balancing  loadbalancer  loadbalancing  math  message-broker  messaging  metrics  microservices  model  netflix  network  networking  nginx  observability  ops  paper  papers  partitioning  patterns  paxos  pdf  people  performance  peter-bourgon  programming-language  programming  python  queues  raft  read2of  reading  reliability  research  ruby  rust  sagas  scala  scalability  scale  scaling  scheduling  slides  social  software-archaeology  software-architecture  software-development  software  soundcloud  sql  statistics  storage  supercomputer  swift  system-architecture  systems  talks  tdd  testing  the-morning-paper  timeseries  transactions  tutorial  tweet-threads-that-should-be-blog-posts  visualization  zookeeper 

Copy this bookmark: