mpm + fault-tolerance   32

Shuffle Sharding: Massive and Magical Fault Isolation
Shuffle Sharding is a general-purpose technique, and you can also choose to Shuffle Shard across many kinds of resources, including pure in-memory data-structures such as queues, rate-limiters, locks and other contended resources.
fault-tolerance  load-balancing 
8 weeks ago by mpm
HydrOS Project
HydrOS is a multikernel research operating system written to withstand complete software failure, as well as partial hardware failure, in Erlang. HydrOS uses the multikernel model to seperate a multicore computer into a set of individual computing nodes, each capable of withstanding the failure of the others.
fault-tolerance  reliability  erlang 
december 2017 by mpm
Efficient and Modular Consensus-Free Reconfiguration for Fault-Tolerant Storage
Quorum systems are useful tools for implementing consistent and available storage in the presence of failures. These systems usually comprise a static set of servers that provide a fault-tolerant read/write register accessed by a set of clients. We consider a dynamic variant of these systems and propose FreeStore, a set of fault-tolerant protocols that emulates a register in dynamic asynchronous systems in which processes are able to join/leave the servers set during the execution. These protoco...
consensus  replication  storage  fault-tolerance 
july 2016 by mpm
BeeHive: An Efficient Fault-Tolerant Routing Algorithm Inspired by Honey Bee
Bees organize their foraging activities as a social and com-municative effort, indicating both the direction, distance and quality of food sources to their fellow foragers through a ”dance ” inside the bee hive (on the ”dance floor”). In this paper we present a novel routing algorithm, BeeHive, which has been inspired by the communicative and evaluative methods and procedures of honey bees. In this algorithm, bee agents travel through network regions called foraging zones. On their way their information on the network state is delivered for updating the local routing tables. BeeHive is fault tolerant, scalable, and relies com-pletely on local, or regional, information, respectively. We demonstrate through extensive simulations that BeeHive achieves a similar or better performance compared to state-of-the-art algorithms
fault-tolerance  networking  bio-inspired  bees 
june 2016 by mpm
Self-Healing Protocols for Connectivity Maintenance in Unstructured Overlays
In this paper, we discuss on the use of self-organizing protocols to improve the reliability of dynamic Peer-to-Peer (P2P) overlay networks. Two similar approaches are studied, which are based on local knowledge of the nodes' 2nd neighborhood. The first scheme is a simple protocol requiring interactions among nodes and their direct neighbors. The second scheme adds a check on the Edge Clustering Coefficient (ECC), a local measure that allows determining edges connecting different clusters in the network. The performed simulation assessment evaluates these protocols over uniform networks, clustered networks and scale-free networks. Different failure modes are considered. Results demonstrate the effectiveness of the proposal
overlay  fault-tolerance 
july 2015 by mpm
All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
We present the first comprehensive study of applicationlevel crash-consistency protocols built atop modern file systems. We find that applications use complex update protocols to persist state, and that the correctness of these protocols is highly dependent on subtle behaviors of the underlying file system, which we term persistence properties. We develop a tool named BOB that empirically tests persistence properties, and use it to demonstrate that these properties vary widely among six popular Linux file systems. We build a framework named ALICE that analyzes application update protocols and finds crash vulnerabilities, i.e., update protocol code that requires specific persistence properties to hold for correctness. Using ALICE, we analyze eleven widely-used systems (including databases, key-value stores, version control systems, distributed systems, and virtualization software) and find a total of 60 vulnerabilities, many of which lead to severe consequences. We also show that ALICE can be used to evaluate the effect of new filesystem designs on application-level consistency.
filesystem  fault-tolerance  integrity 
november 2014 by mpm
Blockade is a utility for testing network failures and partitions in distributed applications. Blockade uses Docker containers to run application processes and manages the network from the host system to create various failure scenarios
testing  networking  fault-tolerance 
february 2014 by mpm
Partitions.tcl is a small Tcl program to simulate network partitions among a set of real computers
networking  testing  fault-tolerance 
december 2013 by mpm
The network is reliable
much of what we know about the failure modes of real-world distributed systems is founded on guesswork and rumor. Sysadmins and developers will swap stories over beers, but detailed, public postmortems and comprehensive surveys of network availability are few and far between. In this post, we’d like to bring a few of these stories together. We believe this is a first step towards a more open and honest discussion of real-world partition behavior, and, ultimately, more robust distributed systems design
networking  availability  consistency  reliability  fault-tolerance  outage 
june 2013 by mpm
Viewstamped Replication Revisited
This paper presents an updated version of Viewstamped Replication, a replication technique that handles failures in which nodes crash. It describes how client requests are handled, how the group reorganizes when a replica fails, and how a failed replica is able to rejoin the group. The paper also describes a number of important optimizations and presents a protocol for handling reconfigurations that can change both the group membership and the number of failures the group is able to handle.
consensus  consistency  fault-tolerance  availability 
june 2013 by mpm
The purpose of this FAQ is to explain what is known about CAP, so as to help those new to the theorem get up to speed quickly, and to settle some common misconceptions or points of disagreement
availability  consistency  fault-tolerance 
may 2013 by mpm
HAT, not CAP: Highly Available Transactions
To provide high availability, many scalable data stores abandon traditional database functionality, often offering operations limited to single objects (or groups of co-located objects) with limited consistency. However, many applications benefit from transactions, or larger units of arbitrary combinations of multiple operations on multiple objects. While the CAP theorem is often interpreted to preclude the availability of transactions in a partition-prone environment, we show that highly available systems can provide transactional guarantees matching the majority of today's ACID databases. We propose Highly Available Transactions (HATs) that support many desirable semantic guarantees for arbitrary transactional sequences of read and write operations, execute with low latency, and remain available during partitions
availability  fault-tolerance 
february 2013 by mpm
Dynamic Voting for Consistent Primary Components
The dynamic voting paradigm allows such systems to define quorums adaptively, accounting for the changes in the set of participants. Rrrthermore, dynamic voting was proven to be the most available paradigm for maintaining quorums in unreliable networks. However, the subtleties of implementing dynamic voting were not well understood; in fact, many of the suggested protocols may lead to inconsistencies in caseof failurs.
consensus  consistency  fault-tolerance 
december 2012 by mpm
Hystrix for Resilience Engineering
In a distributed environment, failure of any given service is inevitable. Hystrix is a library designed to control the interactions between these distributed services providing greater tolerance of latency and failure. Hystrix does this by isolating points of access between the services, stopping cascading failures across them, and providing fallback options, all of which improve the system's overall resiliency.
fault-tolerance  reliability  availability 
december 2012 by mpm
Berkeley Cloud Seminar
Presentations on various distributed & cloudy topics
distributed  availability  fault-tolerance  consistency 
april 2012 by mpm
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques
base  consistency  fault-tolerance 
april 2012 by mpm
Boxwood: Abstractions as the Foundation for Storage Infrastructure
We have built a system called Boxwood to explore the feasibility and utility of providing high-level abstractions or data structures as the fundamental storage infrastructure
distributed  consensus  datastructure  storage  fault-tolerance 
march 2012 by mpm
Paxos Made Moderately Complex
This paper provides imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing various implementation details.
paxos  consistency  consensus  distributed  fault-tolerance  leader-election 
march 2012 by mpm
The SMART way to migrate replicated stateful services
This paper describes SMART, a new technique for changing the set of machines where such a service runs, i.e., migrating the service
fault-tolerance  availability  consensus  consistency 
march 2012 by mpm
Efficient Replica Maintenance for Distributed Storage Systems
This paper considers replication strategies for storage systems that aggregate the disks of many nodes spread over the Internet. Maintaining replication in such systems can be prohibitively expensive, since every transient network or host failure could potentially lead to copying a server's worth of data over the Internet to maintain replication levels.
storage  availability  fault-tolerance 
march 2012 by mpm
Lower Bounds for Asynchronous Consensus
Impossibility results and best-case lower bounds are proved for the number of message delays and the number of processes required to reach agreement in an asynchronous consensus algorithm that tolerates non-Byzantine failure
consensus  availability  consistency  distributed  fault-tolerance 
march 2012 by mpm
Diskless Paxos crash recovery
An algorithm for Paxos crash-recovery that does not require persistent storage, by utilizing synchronized clocks and a lattice-based epoch numbering.
paxos  distributed  fault-tolerance 
august 2011 by mpm
A Fault-Tolerant Token based Atomic Broadcast Algorithm
This paper presents the first token based atomic broadcast algorithm that uses an unreliable failure detector instead of a group membership service
fault-tolerance  alm  overlay 
june 2011 by mpm
Network Awareness and Failure Resilience In Self-Organising Overlay Networks
In this paper, we propose an algorithm called the localiser which addresses these three key challenges. The localiser refines the overlay in a way that reflects geographic locality so as to reduce network load. Simultaneously, it helps to evenly balance the number of neighbours of each node in the overlay, thereby sharing the load evenly as well as improving the resilience to random node failures or disconnections.
overlay  fault-tolerance 
june 2011 by mpm
Don’t Lose Your ets Tables
Crashing a process when something unexpected occurs is perfectly fine, since coding defensively introduces problems of its own, but you can still avoid losing your ets tables like this relatively easily
erlang  fault-tolerance 
march 2011 by mpm
ACMS: The Akamai Configuration Management System
In this paper we discuss the design and implementation of a configuration management system for the Akamai Network. It allows reliable yet highly asynchronous delivery of configuration information, is significantly fault-tolerant, and can scale if necessary to hundreds of thousands of servers
cm  deployment  distributed  fault-tolerance 
february 2011 by mpm
Injecting Errors for Fun and Profit
Error-detection and correction features are only as good as our ability to test them
fault-removal  fault-tolerance 
august 2010 by mpm
Gimli is a crash tracing/analysis framework
fault-tolerance  fault-removal  deployment  availability 
july 2010 by mpm
Did you just tell me to go fuck myself?
fun  fault-tolerance  erlang  hadoop 
july 2009 by mpm

Copy this bookmark: