mpm + consistency   86

Designing an Efficient Replicated Log Store with Consensus Protocol
Highly available and high-performance message logging system is critical building block for various use cases that require global ordering, especially for deterministic distributed transactions. To achieve availability, we maintain multiple replicas that have the same payloads in exactly the same order. This introduces various challenging issues such as consistency between replicas after failure, while minimizing performance degradation. Replicated state machine-based consensus protocols are the most suitable candidates to fulfill those requirements, but double-write problem and different logging granularity make it hard to keep the system efficient. This paper suggests a novel way to build a replicated log store on top of Raft consensus protocol, aiming at providing the same level of consistency as well as fault-tolerance without sacrificing the throughput of the system.
consensus  consistency  replication  storage  database 
4 weeks ago by mpm
Determining whether online users are authorized to access digital objects is central to preserving privacy. This paper presents the design, implementation, and deployment of Zanzibar, a global system for storing and evaluating access control lists. Zanzibar provides a uniform data model and configuration language for expressing a wide range of access control policies from hundreds of client services at Google, including Calendar, Cloud, Drive, Maps, Photos, and YouTube. Its authorization decisions respect causal ordering of user actions and thus provide external consistency amid changes to access control lists and object contents. Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services used by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater than 99.999% over 3 years of production use.
consistency  authorization  integrity 
7 weeks ago by mpm
Distributed Transactional Systems Cannot Be Fast
We prove that no fully transactional system can provide fast read transactions (including read-only ones that are considered the most frequent in practice). Specifically, to achieve fast read transactions, the system has to give up support of transactions that write more than one object. We prove this impossibility result for distributed storage systems that are causally consistent, i.e., they do not require to ensure any strong form of consistency. Therefore, our result holds also for any system that ensures a consistency level stronger than causal consistency, e.g., strict serializability. The impossibility result holds even for systems that store only two objects (and support at least two servers and at least four clients). It also holds for systems that are partially replicated. Our result justifies the design choices of state-of-the-art distributed transactional systems and insists that system designers should not put more effort to design fully-functional systems that support both fast read transactions and ensure causal or any stronger form of consistency.
performance  database  consistency 
8 weeks ago by mpm
On mixing eventual and strong consistency: Bayou revisited
In this paper we study the properties of eventually consistent distributed systems that feature arbitrarily complex semantics and mix eventual and strong consistency. These systems execute requests in a highly-available, weakly-consistent fashion, but also enable stronger guarantees through additional inter-replica synchronization mechanisms that require the ability to solve distributed consensus. We use the seminal Bayou system as a case study, and then generalize our findings to a whole class of systems. We show dubious and unintuitive behaviour exhibited by those systems and provide a theoretical framework for reasoning about their correctness. We also state an impossibility result that formally proves the inherent limitation of such systems, namely temporary operation reordering, which admits interim disagreement between replicas on the relative order in which the client requests were executed.
consistency  replication 
9 weeks ago by mpm
Efficient, Consistent Distributed Computation with Predictive Treaties
To achieve good performance, modern applications often par-tition their state across multiple geographically distributednodes. While this approach reduces latency in the commoncase, it can be challenging for programmers to use correctly,especially in applications that require strong consistency. Weintroducepredictive treaties, a mechanism that can signifi-cantly reduce distributed coordination without losing strongconsistency. The central insight behind our approach is thatmany computations can be expressed in terms of predicatesover distributed state that can be partitioned and enforcedlocally. Predictive treaties improve on previous work by al-lowing the locally enforced predicates to depend on time.Intuitively, by predicting the evolution of system state, coor-dination can be significantly reduced compared to static ap-proaches. We implemented predictive treaties in a distributedsystem that exposes them in an intuitive programming model.We evaluate performance on several benchmarks, includingTPC-C, showing that predictive treaties can significantly in-crease performance by orders of magnitude and can evenoutperform customized algorithms.
consistency  performance  coordination 
9 weeks ago by mpm
It’s Time to Move on from Two Phase Commit
In my opinion we need to remove veto power from workers and architect systems in which the system does not have freedom to abort a transaction
database  protocol  consistency 
may 2019 by mpm
How to do distributed locking
The algorithm claims to implement fault-tolerant distributed locks (or rather, leases) on top of Redis, and the page asks for feedback from people who are into distributed systems. The algorithm instinctively set off some alarm bells in the back of my mind, so I spent a bit of time thinking about it and writing up these notes.
consistency  concurrency 
april 2019 by mpm
Ensuring referential integrity under causal consistency
Referential integrity (RI) is an important correctness property of a shared, distributed object storage system. It is sometimes thought that enforcing RI requires a strong form of consistency. In this paper, we argue that causal consistency suffices to maintain RI. We support this argument with pseudocode for a reference CRDT data type that maintains RI under causal consistency. QuickCheck has not found any errors in the model.
consistency  causal 
december 2018 by mpm
Distributed transactional reads: the strong, the quick, the fresh & the impossible
This paper studies the costs and trade-offs of providing transactional consistent reads in a distributed storage system. We identify the following dimensions: read consistency, read delay (latency), and data freshness. We show that there is a three-way trade-off between them, which can be summarised as follows: (i) it is not possible to ensure at the same time order-preserving (e.g., causally-consistent) or atomic reads, Minimal Delay, and maximal freshness; thus, reading data that is the most fresh without delay is possible only in a weakly-isolated mode; (ii) to ensure atomic or order-preserving reads at Minimal Delay imposes to read data from the past (not fresh); (iii) however, order-preserving minimal-delay reads can be fresher than atomic; (iv) reading atomic or order-preserving data at maximal freshness may block reads or writes indefinitely. Our impossibility results hold independently of other features of the database, such as update semantics (totally ordered or not) or data model (structured or unstructured). Guided by these results, we modify an existing protocol to ensure minimal-delay reads (at the cost of freshness) under atomic-visibility and causally-consistent semantics. Our experimental evaluation supports the theoretical results.
consistency  latency 
december 2018 by mpm
The FuzzyLog: A Partially Ordered Shared Log
The FuzzyLog is a partially ordered shared log abstraction. Distributed applications can concurrently append to the partial order and play it back. FuzzyLog applications obtain the benefits of an underlying shared log – extracting strong consistency, durability, and failure atomicity in simple ways – without suffering from its drawbacks. By exposing a partial order, the FuzzyLog enables three key capabilities for applications: linear scaling for throughput and capacity (without sacrificing atomicity), weaker consistency guarantees, and tolerance to network partitions. We present Dapple, a distributed implementation of the FuzzyLog abstraction that stores the partial order compactly and supports efficient appends / playback via a new ordering protocol. We implement several data structures and applications over the FuzzyLog, including several map variants as well as a ZooKeeper implementation. Our evaluation shows that these applications are compact, fast, and flexible: they retain the simplicity (100s of lines of code) and strong semantics (durability and failure atomicity) of a shared log design while exploiting the partial order of the Fuzzy-Log for linear scalability, flexible consistency guarantees (e.g., causal+ consistency), and network partition tolerance. On a 6-node Dapple deployment, our FuzzyLog- based ZooKeeper supports 3M/sec single-key writes, and 150K/sec atomic cross-shard renames.
database  consistency  storage 
december 2018 by mpm
Partitioned consensus and its impact on Spanner’s latency
I noticed that there appears to be a general assumption amongst my readers that unified consensus systems must have higher latency than partitioned consensus systems. In this post, I want to clear up some of the misconceptions and inaccurate assumptions around these latency tradeoffs, and present a deeper (and technical) analysis on how these different approaches to consensus have surprisingly broad consequences on transaction latency. We will analyze the latency tradeoff from three perspectives: (1) Latency for write transactions, (2) Latency for linearizable read-only transactions and (3) Latency for serializable snapshot transactions.
consensus  latency  consistency 
december 2018 by mpm
Liberating Distributed Consensus
Three example algorithms which achieve consensus in 1 round trip and tolerate any minority failure
paxos  consistency  consensus 
november 2018 by mpm
LogDevice is a scalable distributed log storage system that offers durability, high availability, and total order of records under failures. It is designed for a variety of workloads including event streaming, replication pipelines, transaction logs, and deferred work journals.
consistency  storage 
september 2018 by mpm
The FuzzyLog: Partially Ordered Shared Log
The FuzzyLog is a partially ordered shared log. Unlike traditional SMR systems, such as Paxos or Tango, which store all events in a single total order, the FuzzyLog allows the storage and update of partially ordered histories. This relaxation of ordering contraints enables richer application semantics around consistency guarentees, data partitioning and log-playback, while retaining the ease-of-programming of the shared-log model.
consistency  storage  database  datastructure 
august 2018 by mpm
Session Guarantees with Raft and Hybrid Logical Clocks
Eventual consistency is a popular consistency model for geo-replicated data stores. Although eventual consistency provides high performance and availability, it can cause anomalies that make programming complex for application developers. Session guarantees can remove some of these anomalies while causing much lower overhead compared with stronger consistency models. In this paper, we provide a protocol for providing session guarantees for NuKV, a key-value store developed for services with very high availability and performance requirements at eBay. NuKV relies on the Raft protocol for replication inside datacenters, and uses eventual consistency for replication among datacenters. We provide modified versions of conventional session guarantees to avoid the problem of slowdown cascades in systems with large numbers of partitions. We also use Hybrid Logical Clocks to eliminate the need for delaying write operations to satisfy session guarantees. Our experiments show that our protocol provides session guarantees with a negligible overhead when compared with eventual consistency.
august 2018 by mpm
Consistency Models
This map (adapted from Bailis, Davidson, Fekete et al and Viotti & Vukolic) shows the relationships between common consistency models for concurrent systems.
july 2018 by mpm
Just-Right Consistency: reconciling availability and safety
By the CAP Theorem, a distributed data storage system can ensure either Consistency under Partition (CP) or Availability under Partition (AP), but not both. This has led to a split between CP databases, in which updates are synchronous, and AP databases, where they are asynchronous. However, there is no inherent reason to treat all updates identically: simply, the system should be as available as possible, and synchronised just enough for the application to be correct. We offer a principled Just-Right Consistency approach to designing such applications, reconciling correctness with availability and performance, based on the following insights:(i) The Conflict-Free Replicated Data Type (CRDTs) data model supports asynchronous updates in an intuitive and principled way.(ii) Invariants involving joint or mutually-ordered updates are compatible with AP and can be guaranteed by Transactional Causal Consistency, the strongest consistency model that does not compromise availability. Regarding the remaining, "CAP-sensitive" invariants:(iii) For the common pattern of Bounded Counters, we provide encapsulated data type that is proven correct and is efficient; (iv) in the general case, static analysis can identify when synchronisation is not necessary for correctness.Our Antidote cloud database system supports CRDTs, Transactional Causal Consistency and the Bounded Counter data type. Support tools help design applications by static analysis and proof of CAP-sensitive invariants. This system supports industrial-grade applications and has been tested experimentally with hundreds of servers across several geo-distributed data centres.
consistency  availability 
february 2018 by mpm
Verifying Strong Eventual Consistency in Distributed Systems
Data replication is used in distributed systems to maintain up-to-date copies of shared data across multiple computers in a network. However, despite decades of research, algorithms for achieving consistency in replicated systems are still poorly understood. Indeed, many published algorithms have later been shown to be incorrect, even some that were accompanied by supposed mechanised proofs of correctness. In this work, we focus on the correctness of Conflict-free Replicated Data Types (CRDTs), a class of algorithm that provides strong eventual consistency guarantees for replicated data. We develop a modular and reusable framework in the Isabelle/HOL interactive proof assistant for verifying the correctness of CRDT algorithms. We avoid correctness issues that have dogged previous mechanised proofs in this area by including a network model in our formalisation, and proving that our theorems hold in all possible network behaviours. Our axiomatic network model is a standard abstraction that accurately reflects the behaviour of real-world computer networks. Moreover, we identify an abstract convergence theorem, a property of order relations, which provides a formal definition of strong eventual consistency. We then obtain the first machine-checked correctness theorems for three concrete CRDTs: the Replicated Growable Array, the Observed-Remove Set, and an Increment-Decrement Counter. We find that our framework is highly reusable, developing proofs of correctness for the latter two CRDTs in a few hours and with relatively little CRDT-specific code.
consistency  crdt  testing 
august 2017 by mpm
DottedDB: Anti-Entorpy without Merkle Trees, Deletes without Tombstones
To achieve high availability in the face of network partitions, many distributed databases adopt eventual consistency, allow temporary conflicts due to concurrent writes, and use some form of per-key logical clock to detect and resolve such conflicts. Furthermore, nodes synchronize periodically to ensure replica convergence in a process called anti-entropy, normally using Merkle Trees. We present the design of DottedDB, a Dynamo-like key-value store, which uses a novel nodewide logical clock framework, overcoming three fundamental limitations of the state of the art: (1) minimize the metadata per key necessary to track causality, avoiding its growth even in the face of node churn; (2) correctly and durably delete keys, with no need for tombstones; (3) offer a lightweight antientropy mechanism to converge replicated data, avoiding the need for Merkle Trees. We evaluate DottedDB against MerkleDB, an otherwise identical database, but using per-key logical clocks and Merkle Trees for anti-entropy, to precisely measure the impact of the novel approach. Results show that: causality metadata per object always converges rapidly to only one id-counter pair; distributed deletes are correctly achieved without global coordination and with constant metadata; divergent nodes are synchronized faster, with less memory-footprint and with less communication overhead than using Merkle Trees.
database  coordination  consistency 
august 2017 by mpm
A Publish/Subscribe System Using Causal Broadcast Over Dynamically Built Spanning Trees
In this paper we present VCube-PS, a topic-based Publish/Subscribe system built on the top of a virtual hypercube-like topology. Membership information and published messages to subscribers (members) of a topic group are broadcast over dynamically built spanning trees rooted at the message's source. For a given topic, delivery of published messages respects causal order. Performance results of experiments conducted on the PeerSim simulator confirm the efficiency of VCube-PS in terms of scalability, latency, number, and size of messages when compared to a single rooted, not dynamically, tree built approach
alm  protocol  broadcast  causal  consistency 
july 2017 by mpm
A prototype of a Dynamo-style distributed key-value database, implementing Server Wide Clocks as the main causality mechanism across the system.
database  consistency 
june 2017 by mpm
Saturn: a Distributed Metadata Service for Causal Consistency
This paper presents the design, implementation, and evaluation of Saturn, a metadata service for geo-replicated systems. Saturn can be used in combination with several distributed and replicated data services to ensure that remote operations are made visible in an order that respects causality, a requirement central to many consistency criteria.
causal  consistency  replication 
april 2017 by mpm
Spanner vs. Calvin
I found it very difficult to find cases where an ideal implementation of Spanner theoretically outperforms an ideal implementation of Calvin.
consistency  consensus  database  performance  scalability 
april 2017 by mpm
Consistency in Non-Transactional Distributed Storage Systems
Over the years, different meanings have been associated to the word consistency in the distributed systems community. While in the '80s "consistency" typically meant strong consistency, later defined also as linearizability, in recent years, with the advent of highly available and scalable systems, the notion of "consistency" has been at the same time both weakened and blurred. In this paper we aim to fill the void in literature, by providing a structured and comprehensive overview of different consistency notions that appeared in distributed systems, and in particular storage systems research, in the last four decades. We overview more than 50 different consistency notions, ranging from linearizability to eventual and weak consistency, defining precisely many of these, in particular where the previous definitions were ambiguous. We further provide a partial order among different consistency predicates, ordering them by their semantic "strength", which we believe will reveal useful in future research. Finally, we map the consistency semantics to different practical systems and research prototypes.
january 2017 by mpm
Eventually Consistent Transactions
We propose a novel consistency model based on eventually consistent transactions. Unlike serializable transactions, eventually consistent transactions are ordered by two order relations (visibility and arbitration) rather than a single order relation
consistency  database 
july 2016 by mpm
Linearizability versus Serializability
Linearizability and serializability are both important properties about interleavings of operations in databases and distributed systems, and it’s easy to get them confused. This post gives a short, simple, and hopefully practical overview of the differences between the two.
december 2015 by mpm
Eventually Consistent Register Revisited
In order to converge in the presence of concurrent updates, modern eventually consistent replication systems rely on causality information and operation semantics. It is relatively easy to use semantics of high-level operations on replicated data structures, such as sets, lists, etc. However, it is difficult to exploit semantics of operations on registers, which store opaque data. In existing register designs, concurrent writes are resolved either by the application, or by arbitrating them according to their timestamps. The former is complex and may require user intervention, whereas the latter causes arbitrary updates to be lost. In this work, we identify a register construction that generalizes existing ones by combining runtime causality ordering, to identify concurrent writes, with static data semantics, to resolve them. We propose a simple conflict resolution template based on an application-predefined order on the domain of values. It eliminates or reduces the number of conflicts that need to be resolved by the user or by an explicit application logic. We illustrate some variants of our approach with use cases, and how it generalizes existing designs
consistency  crdt 
november 2015 by mpm
Exactly-Once Quantity Transfer
This paper addresses a specific problem: the exactly-once transfer of a “quantity” from one node to another on an unreliable network (coping with message duplication, loss, or reordering) and without any form of global synchronization. This allows preserving a global property (the sum of quantities remains unchanged) without requiring global linearizability and only through using pairwise interactions between nodes, thereforeallowing partitions in the system
consistency  distributed  crdt 
november 2015 by mpm
Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail
The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concerning the characterization of causality are presented. Recent work on the detection of causal relationships in distributed computations is surveyed. The issue of observing distributed computations in a causally consistent way and the basic problems of detecting global predicates are discussed. To illustrate the major difficulties, some typical monitoring and debugging approaches are assessed, and it is demonstrated how their feasibility is severely limited by the fundamental problem to master the complexity of causal relationships
october 2015 by mpm
Putting Consistency Back into Eventual Consistency
Geo-replicated storage systems are at the core of current Internet services. The designers of the replication protocols used by these systems must choose between either supporting low-latency, eventually-consistent operations, or ensuring strong consistency to ease application correctness. We propose an alternative consistency model, Explicit Consistency, that strengthens eventual consistency with a guarantee to preserve specific invariants defined by the applications. Given these application-specific invariants, a system that supports Explicit Consistency identifies which operations would be unsafe under concurrent execution, and allows programmers to select either violation-avoidance or invariant-repair techniques. We show how to achieve the former, while allowing operations to complete locally in the common case, by relying on a reservation system that moves coordination off the critical path of operation execution. The latter, in turn, allows operations to execute without restriction, and restore invariants by applying a repair operation to the database state. We present the design and evaluation of Indigo, a middleware that provides Explicit Consistency on top of a causally-consistent data store. Indigo guarantees strong application invariants while providing similar latency to an eventually-consistent system in the common case
consistency  availability 
may 2015 by mpm
Pistachio: co-locate the data and compute for fastest cloud compute
Pistachio is a distributed key value store system. Data can be replicated with n replicas with strong consistency guarantees. Up to (n-1) failures can be tolerated. It’s being used as the user profile storage for large scale ads serving products within Yahoo. 10+ billions of user profiles are being stored with ~2 million reads QPS, 0.8GB/s read throughput and ~0.5 million writes QPS, 0.3GB/s write throughput. Average latency is under 1ms. It guarantees strong consistency and fault-tolerance. We have hundreds of servers in 8 data centers all over the globe supporting hundreds of millions in revenue.
dht  storage  consistency 
may 2015 by mpm
Salt: Combining ACID and BASE in a Distributed Database
This paper presents Salt, a distributed database that allows developers to improve the performance and scalability of their ACID applications through the incremental adoption of the BASE approach. Salt’s motivation is rooted in the Pareto principle: for many applications, the transactions that actually test the performance limits of ACID are few. To leverage this insight, Salt introduces BASE transactions, a new abstraction that encapsulates the workflow of performance-critical transactions. BAS...
consistency  database  base  transactions 
october 2014 by mpm
Building global and scalable systems with Atomic Multicast
The rise of worldwide Internet-scale services demands large distributed systems. Indeed, when handling several millions of users, it is common to operate thousands of servers spread across the globe. Here, replication plays a central role, as it contributes to improve the user experience by hiding failures and by providing acceptable latency. In this paper, we claim that atomic multicast, with strong and well-defined properties, is the appropriate abstraction to efficiently design and implement globally scalable distributed systems. We substantiate our claim with the design of two modern online services atop atomic multicast, a strongly consistent key-value store and a distributed log. In addition to presenting the design of these services, we experimentally assess their performance in a geographically distributed deployment
consistency  performance  scalability 
july 2014 by mpm
Making Operation-based CRDTs Operation-based
Conflict-free Replicated Datatypes (CRDT) can simplify the design of eventually consistent systems. They can be classi ed into state-based or operation-based. Operation-based designs have the potential for allowing very compact solutions in both the sent messages and the object state size. Unfortunately, the current approaches are still far from this objective. In this paper, we introduce a new `pure' operation-based framework that makes the design and the implementation of these CRDTs more simple and ecient. We show how to leverage the meta-data of the messaging middleware to design very compact CRDTs, while only disseminating operation names and their optional arguments.
crdt  consistency  replication 
march 2014 by mpm
Scalable and Accurate Causality Tracking for Eventually Consistent Stores
We propose a new logical clock mechanism and a logical clock framework that together support a traditional key-value store API, while capturing causality in an accurate and scalable way, avoiding false conflicts. It maintains concise information per data replica, only linear on the number of replica servers, and allows data replicas to be compared and merged linear with the number of replica servers and versions.
march 2014 by mpm
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency
Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time. WAS customers have access to their data from anywhere at any time and only pay for what they use and store. In WAS, data is stored durably using both local and geographic replication to facilitate disaster recovery. Currently, WAS storage comes in the form of Blobs (files), Tables (structured storage), and Queues (message delivery). In this paper, we describe the WAS architecture, global namespace, and data model, as well as its resource provisioning, load balancing, and replication systems.
consistency  availability  storage 
march 2014 by mpm
Replicated Data Types: Specification, Verification, Optimality
Geographically distributed systems often rely on replicated eventually consistent data stores to achieve availability and performance. To resolve conflicting updates at different replicas, researchers and practitioners have proposed specialized consistency protocols, called replicated data types, that implement objects such as registers, counters, sets or lists. Reasoning about replicated data types has however not been on par with comparable work on abstract data types and concurrent data types, lacking specifications, correctness proofs, and optimality results.

To fill in this gap, we propose a framework for specifying replicated data types using relations over events and verifying their implementations using replication-aware simulations. We apply it to 7 existing implementations of 4 data types with nontrivial conflict-resolution strategies and optimizations (last-writer-wins register, counter, multi-value register and observed-remove set). We also present a novel technique for obtaining lower bounds on the worst-case space overhead of data type implementations and use it to prove optimality of 4 implementations. Finally, we show how to specify consistency of replicated stores with multiple objects axiomatically, in analogy to prior work on weak memory models. Overall, our work provides foundational reasoning tools to support research on replicated eventually consistent stores.
crdt  consistency  replication 
january 2014 by mpm
Session Guarantees for Weakly Consistent Replicated Data
Four per-session guarantees are proposed to aid users and applications of weakly consistent replicated data: Read Your Writes, Monotonic Reads, Writes Follow Reads, and Monotonic Writes. The intent is to present individual applications with a view of the database that is consistent with their own actions, even if they read and write from various, potentially inconsistent servers. The guarantees can be layered on existing systems that employ a read-any/ write-any replication scheme while retaining the principal benefits of such a scheme, namely high-availability, simplicity, scalability, and support for disconnected operation. These session guarantees were developed in the context of the Bayou project at Xerox PARC in which we are designing and building a replicated storage system to support the needs of mobile computing users who may be only intermittently connected.
december 2013 by mpm
Vive la Diff'erence: Paxos vs. Viewstamped Replication vs. Zab
Paxos, Viewstamped Replication, and Zab are replication protocols that ensure high-availability in asynchronous environments with crash failures. Various claims have been made about similarities and differences between these protocols. But how does one determine whether two protocols are the same, and if not, how significant the differences are? We propose to address these questions using refinement mappings, where protocols are expressed as succinct specifications that are progressively refined to an implementable protocol. Doing so enables a principled understanding of their correctness, and it provides clear guidelines to implement the protocols correctly. Additionally, comparing Paxos, Viewstamped Replication, and Zab using this approach allowed us to identify key differences that have a significant impact on performance
paxos  consistency 
september 2013 by mpm
A highly-available key value store for shared configuration and service discovery. etcd is inspired by zookeeper and doozer
consistency  dht  paxos 
july 2013 by mpm
Non-Monotonic Snapshot Isolation
Many distributed applications require transactions. However, transactional protocols that require strong synchronization are costly in large scale environments. Two properties help with scalability of a transactional system: genuine partial replication (GPR), which leverages the intrinsic parallelism of a workload, and snapshot isolation (SI), which decreases the need for synchronization. We show that, under standard assumptions (data store accesses are not known in advance, and transactions may access arbitrary objects in the data store), it is impossible to have both SI and GPR. To circumvent this impossibility, we propose a weaker consistency criterion, called Non-monotonic Snapshot Isolation (NMSI). NMSI retains the most important properties of SI, i.e., read-only transactions always commit, and two write-conflicting updates do not both commit. We present a GPR protocol that ensures NMSI, and has lower message cost (i.e., it contacts fewer replicas and/or commits faster) than previous approaches.
consistency  availability  database 
july 2013 by mpm
Scalable Eventually Consistent Counters over Unreliable Networks
Counters are an important abstraction in distributed computing, and play a central role in large scale geo-replicated systems, counting events such as web page impressions or social network "likes". Classic distributed counters, strongly consistent, cannot be made both available and partition-tolerant, due to the CAP Theorem, being unsuitable to large scale scenarios. This paper defines Eventually Consistent Distributed Counters (ECDC) and presents an implementation of the concept, Handoff Counters, that is scalable and works over unreliable networks. By giving up the sequencer aspect of classic distributed counters, ECDC implementations can be made AP in the CAP design space, while retaining the essence of counting. Handoff Counters are the first CRDT (Conflict-free Replicated Data Type) based mechanism that overcomes the identity explosion problem in naive CRDTs, such as G-Counters (where state size is linear in the number of independent actors that ever incremented the counter), by managing identities towards avoiding global propagation and garbage collecting temporary entries. The approach used in Handoff Counters is not restricted to counters, being more generally applicable to other data types with associative and commutative operations
consistency  crdt 
july 2013 by mpm
Analyzing Consistency Properties for Fun and Profit
Motivated by the increasing popularity of eventually consistent key-value stores as a commercial service, we address two important problems related to the consistency properties in a history of operations on a read/write register (i.e., the start time, finish time, argument, and response of every operation). First, we consider how to detect a consistency violation as soon as one happens. To this end, we formulate a specification for online verification algorithms, and we present such algorithms for several well-known consistency properties. Second, we consider how to quantify the severity of the violations, if a history is found to contain consistency violations. We investigate two quantities: one is the staleness of the reads, and the other is the commonality of violations. For staleness, we further consider time- based staleness and operation-count-based staleness. We present efficient algorithms that compute these quantities. We believe that addressing these problems helps both key-value store providers and users adopt data consistency as an important aspect of key-value store offerings
june 2013 by mpm
The network is reliable
much of what we know about the failure modes of real-world distributed systems is founded on guesswork and rumor. Sysadmins and developers will swap stories over beers, but detailed, public postmortems and comprehensive surveys of network availability are few and far between. In this post, we’d like to bring a few of these stories together. We believe this is a first step towards a more open and honest discussion of real-world partition behavior, and, ultimately, more robust distributed systems design
networking  availability  consistency  reliability  fault-tolerance  outage 
june 2013 by mpm
Viewstamped Replication Revisited
This paper presents an updated version of Viewstamped Replication, a replication technique that handles failures in which nodes crash. It describes how client requests are handled, how the group reorganizes when a replica fails, and how a failed replica is able to rejoin the group. The paper also describes a number of important optimizations and presents a protocol for handling reconfigurations that can change both the group membership and the number of failures the group is able to handle.
consensus  consistency  fault-tolerance  availability 
june 2013 by mpm
The purpose of this FAQ is to explain what is known about CAP, so as to help those new to the theorem get up to speed quickly, and to settle some common misconceptions or points of disagreement
availability  consistency  fault-tolerance 
may 2013 by mpm
Improving Logical Clocks in Riak with Dotted Version Vectors: A Case Study
Major web applications need the partition-tolerance and availability of the CAP theorem for scalability purposes, thus some adopt the eventual consistent model, which sacrifices consistency. These systems must handle data divergence and conflicts that have to be carefully accounted for. Some systems have tried to use classic Version Vectors to track causality, but these reveal either scalability problems or loss of accuracy if pruning is used to prevent growth. Dotted Version Vectors is a mechanism that deals with data versioning in eventual consistent systems, which allows accurate causality tracking and scalability, both in the number of clients and servers, while limiting vector size to replication degree. However, theories can abstract too much of the hiding properties which difficult the implementation. We discuss the challenges faced when implementing Dotted Version Vectors in Riak- a distributed key-value database-, evaluate its behavior and performance, discuss the tradeoffs made and provide further optimizations.
consistency  time 
may 2013 by mpm
Dynamic Reconfiguration of Primary/Backup Clusters
Dynamically changing (reconfiguring) the membership of a replicated distributed system while preserving data consistency and system availability is a challenging problem. In this paper, we show that reconfiguration can be simplified by taking advantage of certain properties commonly provided by Primary/Backup systems. We describe a new reconfiguration protocol, recently implemented in Apache Zookeeper. It fully automates configuration changes and minimizes any interruption in service to clients while maintaining data consistency. By leveraging the properties already provided by Zookeeper our protocol is considerably simpler than state of the art.
zookeeper  consistency  availability 
march 2013 by mpm
Understanding Eventual Consistency
Modern geo-replicated databases underlying large-scale Internet services guarantee immediate availability and tolerate network partitions at the expense of providing only weak forms of consistency, commonly dubbed eventual consistency. At the moment there is a lot of confusion about the semantics of eventual consistency, as different systems implement it with different sets of features and in subtly different forms, stated either informally or using disparate and low-level formalisms. We address this problem by proposing a framework for formal and declarative specification of the semantics of eventually consistent systems using axioms. Our framework is fully customizable: by varying the set of axioms, we can rigorously define the semantics of systems that combine any subset of typical guarantees or features, including conflict resolution policies, session guarantees, causality guarantees, multiple consistency levels and transactions. We prove that our specifications are validated by an example abstract implementation, based on algorithms used in real-world systems. These results demonstrate that our framework provides system architects with a tool for exploring the design space, and lays the foundation for formal reasoning about eventually consistent systems
march 2013 by mpm
On Consistency of Operational Transformation Approach
The Operational Transformation (OT) approach, used in many collaborative editors, allows a group of users to concurrently update replicas of a shared object and exchange their updates in any order. The basic idea of this approach is to transform any received update operation before its execution on a replica of the object. This transformation aims to ensure the convergence of the different replicas of the object, even though the operations are executed in different orders. However, designing transformation functions for achieving convergence is a critical and challenging issue. Indeed, the transformation functions proposed in the literature are all revealed incorrect.
In this paper, we investigate the existence of transformation functions for a shared string altered by insert and delete operations. From the theoretical point of view, two properties - named TP1 and TP2 - are necessary and sufficient to ensure convergence. Using controller synthesis technique, we show that there are some transformation functions which satisfy only TP1 for the basic signatures of insert and delete operations. As a matter of fact, it is impossible to meet both properties TP1 and TP2 with these simple signatures
march 2013 by mpm
Intra-cluster Replication in Apache Kafka
In the upcoming version 0.8 release, Kafka will support intra-cluster replication, which increases both the availability and the durability of the system. In the following post, I will give an overview of Kafka's replication design
messaging  consistency 
february 2013 by mpm
Understanding Eventual Consistency
At the moment there is a lot of confusion about the semantics of eventual consistency, as different systems implement it with different sets of features and in subtly different forms, stated either informally or using disparate and low-level formalisms. We address this problem by proposing a framework for formal and declarative specification of the semantics of eventually consistent systems using axioms
consistency  base 
january 2013 by mpm
Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary
First, we propose RedBlue consistency, which enables blue operations to be fast (and eventually consistent) while the remaining red operations are strongly consistent (and slow). Second, to make use of fast operation whenever possible and only resort to strong consistency when
needed, we identify conditions delineating when operations can be blue and must be red. Third, we introduce a method that increases the space of potential blue operations by breaking them into separate generator and shadow phases
january 2013 by mpm
Engagements: Building Eventually ACiD Business Transactions
Eventually ACiD Business Transactions are a pattern of long running work comprising many individual “classic” transactions spread across many disparate systems. These systems are independent (autonomous) and are frequently implemented as a set of replicas that will sometimes give confusing answers.

We will propose a messaging and state management abstraction called engagements which we argue can lower the challenges faced in this somewhat chaotic life in the real world.
consistency  base 
january 2013 by mpm
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based
messaging  consistency  protocol 
january 2013 by mpm
Dynamic Voting for Consistent Primary Components
The dynamic voting paradigm allows such systems to define quorums adaptively, accounting for the changes in the set of participants. Rrrthermore, dynamic voting was proven to be the most available paradigm for maintaining quorums in unreliable networks. However, the subtleties of implementing dynamic voting were not well understood; in fact, many of the suggested protocols may lead to inconsistencies in caseof failurs.
consensus  consistency  fault-tolerance 
december 2012 by mpm
IEEE Computer issue on the CAP Theorem
Your question about R+W > N implying consistency is a common misconception, so I'm glad you brought it up
december 2012 by mpm
All about Two-Phase Locking and a little bit MVCC
In this blog I will describe the concurrency control methods implemented in database management systems, and the differences between them
december 2012 by mpm
On Transaction Liveness in Replicated Databases
This paper makes a first attempt to give a precise characterisation of liveness in replicated database systems
consistency  availability 
november 2012 by mpm
MDCC: Multi-Data Center Consistency
In the MDCC work (Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, March 2012), the authors describe an "optimistic commit protocol, that does not require a master or partitioning, and is strongly consistent at a cost similar to eventually consistent protocols". Optimistic and strongly-consistent is an odd/unlikely couple
consistency  paxos 
november 2012 by mpm
A generic gossip based state replication and failure detection service. The network transport is UDP unicast
java  consistency  base  gossip 
november 2012 by mpm
An optimized conflict-free replicated set
We present one of the existing conflict-free replicated data types, Observed-Remove Set. Furthermore, in order to decrease the size of meta-data, we propose a new optimization to avoid tombstones. This approach that can be transposed to other data types, such as maps, graphs or sequences
crdt  consistency 
october 2012 by mpm
Transaction storage for geo-replicated systems
A key feature behind Walter is a new property called Parallel Snapshot Isolation (PSI). PSI allows Walter to replicate data asynchronously, while providing strong guarantees within each site. PSI precludes write-write conflicts, so that developers need not worry about conflict-resolution logic
consistency  database 
october 2012 by mpm
Calvin: Fast Distributed Transactions for Partitioned Database Systems
By replicating transaction inputs rather than effects, Calvin is also able to support multiple consistency levels—including Paxos based strong consistency across geographically distant replicas—at no cost to transactional throughput.
consistency  database  replication 
july 2012 by mpm
Granola: Low-Overhead Distributed Transaction Coordination
This paper presents Granola, a transaction coordination infrastructure for building reliable distributed storage applications. Granola provides a strong consistency model, while significantly reducing transaction coordination overhead
database  consistency 
july 2012 by mpm
File system on CRDT
In this report we show how to manage a distributed hierarchical structure representing a file system. This structure is optimistically replicated, each user work on his local replica, and updates are sent to other replica. The different replicas eventually observe same view of file systems. At this stage, conflicts between updates are very common. We claim that conflict resolution should rely as little as possible on users. In this report we propose a simple and modular solution to resolve these problems and maintain data consistency
consistency  base  crdt 
july 2012 by mpm
Avout brings Clojure's in-memory model of state to distributed application development by providing a distributed implementation of Clojure's Multiversion Concurrency Control (MVCC) STM along with distributable, durable, and extendable versions of Clojure's Atom and Ref concurrency primitives
clojure  zookeeper  consistency 
july 2012 by mpm
A comprehensive study of Convergent and Commutative Replicated Data Types
Eventual consistency aims to ensure that replicas of some mutable shared object converge without foreground synchronisation. Previous approaches to eventual consistency are ad-hoc and error-prone. We study a principled approach: to base the design of shared data types on some simple formal conditions that are sufficient to guarantee eventual consistency. We call these types Convergent or Commutative Replicated Data Types (CRDTs)
distributed  datastructure  consistency  base 
may 2012 by mpm
Convergent Replicated Data Types
datastructure  consistency  base 
may 2012 by mpm
Don't Settle for Eventual: Scalable Causal Consistency for Wide-area Storage with COPS
Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an "always-on" experience where operations always complete with low latency. Today's systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to their clients and necessitating complex application logic. In this paper, we identify and define a consistency model—causal consistency with convergent conflict handling, or causal+—that is the strongest achieved under these constraints
distributed  consistency  time 
april 2012 by mpm
Probabilistically Bounded Staleness
Instead of relying on anecdotal evidence, we can quantitatively demonstrate why eventual consistency is "good enough" for many users. We can predict the expected consistency of an eventually consistent data store using models we've developed, called Probabilistically Bounded Staleness
distributed  consistency  base  safety 
april 2012 by mpm
Achieving high-throughput State Machine Replication in multi-core systems
In this work, we show how to architect a replicated state machine whose performance scales with the number of cores in the nodes. We do so by applying several good practices of concurrent programming to the specific case of state machine replication, including staged execution, workload partitioning, actors, and non-blocking data structures.
distributed  consistency  paxos  performance  scalability 
april 2012 by mpm
Berkeley Cloud Seminar
Presentations on various distributed & cloudy topics
distributed  availability  fault-tolerance  consistency 
april 2012 by mpm
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques
base  consistency  fault-tolerance 
april 2012 by mpm
Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore
Spinnaker is an experimental datastore that is designed to run on a large cluster of commodity servers in a single datacenter. It features key-based range partitioning, 3-way replication, and a transactional get-put API with the option to choose either strong or timeline consistency on reads
paxos  consistency  availability 
april 2012 by mpm
MDCC: Multi-Data Center Consistency
A new commit protocol and programming model for efficiently achieving strong consistency in databases across data centers
consistency  distributed 
april 2012 by mpm
Paxos Made Moderately Complex
This paper provides imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing various implementation details.
paxos  consistency  consensus  distributed  fault-tolerance  leader-election 
march 2012 by mpm
The SMART way to migrate replicated stateful services
This paper describes SMART, a new technique for changing the set of machines where such a service runs, i.e., migrating the service
fault-tolerance  availability  consensus  consistency 
march 2012 by mpm
Lower Bounds for Asynchronous Consensus
Impossibility results and best-case lower bounds are proved for the number of message delays and the number of processes required to reach agreement in an asynchronous consensus algorithm that tolerates non-Byzantine failure
consensus  availability  consistency  distributed  fault-tolerance 
march 2012 by mpm
« earlier      
per page:    204080120160

Copy this bookmark: