jm + papers + redundancy   2

Large-scale cluster management at Google with Borg
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.
We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

(via Conall)
via:conall  clustering  google  papers  scale  to-read  borg  cluster-management  deployment  packing  reliability  redundancy 
april 2015 by jm
Locally Repairable Codes
Facebook’s new erasure coding algorithm (via High Scalability).
Disk I/O and network traffic were reduced by half compared to RS codes.
The LRC required 14% more storage than RS (ie. 60% of data size).
Repair times were much lower thanks to the local repair codes.
Much greater reliability thanks to fast repairs.
Reduced network traffic makes them suitable for geographic distribution.
erasure-coding  facebook  redundancy  repair  algorithms  papers  via:highscalability  data  storage  fault-tolerance 
june 2013 by jm

Copy this bookmark: