mpm + storage   109

Designing an Efficient Replicated Log Store with Consensus Protocol
Highly available and high-performance message logging system is critical building block for various use cases that require global ordering, especially for deterministic distributed transactions. To achieve availability, we maintain multiple replicas that have the same payloads in exactly the same order. This introduces various challenging issues such as consistency between replicas after failure, while minimizing performance degradation. Replicated state machine-based consensus protocols are the most suitable candidates to fulfill those requirements, but double-write problem and different logging granularity make it hard to keep the system efficient. This paper suggests a novel way to build a replicated log store on top of Raft consensus protocol, aiming at providing the same level of consistency as well as fault-tolerance without sacrificing the throughput of the system.
consensus  consistency  replication  storage  database 
4 weeks ago by mpm
SSS: Scalable Key-Value Store with External Consistent and Abort-free Read-only Transactions
We present SSS, a scalable transactional key-value store deploying a novel distributed concurrency control that provides external consistency for all transactions, never aborts read-only transactions due to concurrency, all without specialized hardware. SSS ensures the above properties without any centralized source of synchronization. SSS's concurrency control uses a combination of vector clocks and a new technique, called snapshot-queuing, to establish a single transaction serialization order that matches the order of transaction completion observed by clients. We compare SSS against high performance key-value stores, Walter, ROCOCO, and a two-phase commit baseline. SSS outperforms 2PC-baseline by as much as 7x using 20 nodes; and ROCOCO by as much as 2.2x with long read-only transactions using 15 nodes.
storage  database 
january 2019 by mpm
The FuzzyLog: A Partially Ordered Shared Log
The FuzzyLog is a partially ordered shared log abstraction. Distributed applications can concurrently append to the partial order and play it back. FuzzyLog applications obtain the benefits of an underlying shared log – extracting strong consistency, durability, and failure atomicity in simple ways – without suffering from its drawbacks. By exposing a partial order, the FuzzyLog enables three key capabilities for applications: linear scaling for throughput and capacity (without sacrificing atomicity), weaker consistency guarantees, and tolerance to network partitions. We present Dapple, a distributed implementation of the FuzzyLog abstraction that stores the partial order compactly and supports efficient appends / playback via a new ordering protocol. We implement several data structures and applications over the FuzzyLog, including several map variants as well as a ZooKeeper implementation. Our evaluation shows that these applications are compact, fast, and flexible: they retain the simplicity (100s of lines of code) and strong semantics (durability and failure atomicity) of a shared log design while exploiting the partial order of the Fuzzy-Log for linear scalability, flexible consistency guarantees (e.g., causal+ consistency), and network partition tolerance. On a 6-node Dapple deployment, our FuzzyLog- based ZooKeeper supports 3M/sec single-key writes, and 150K/sec atomic cross-shard renames.
database  consistency  storage 
december 2018 by mpm
LogDevice is a scalable distributed log storage system that offers durability, high availability, and total order of records under failures. It is designed for a variety of workloads including event streaming, replication pipelines, transaction logs, and deferred work journals.
consistency  storage 
september 2018 by mpm
The FuzzyLog: Partially Ordered Shared Log
The FuzzyLog is a partially ordered shared log. Unlike traditional SMR systems, such as Paxos or Tango, which store all events in a single total order, the FuzzyLog allows the storage and update of partially ordered histories. This relaxation of ordering contraints enables richer application semantics around consistency guarentees, data partitioning and log-playback, while retaining the ease-of-programming of the shared-log model.
consistency  storage  database  datastructure 
august 2018 by mpm
Log Pruning in Distributed Event-sourced Systems
Event sourcing is increasingly used and implemented in event-based systems for maintaining the evolution of application state. However, unbounded event logs are impracticable for many systems, as it is difficult to align scalability requirements and long-term runtime behavior with the corresponding storage requirements. To this end, we explore the design space of log pruning approaches suitable for event-sourced systems. Furthermore, we survey specific log pruning mechanisms for event-sourced logs. In a brief evaluation, we point out the trade-offs when applying pruning to event logs and highlight the applicability of log pruning to event-sourced systems.
data  event  storage 
july 2018 by mpm
MySQL 8.0: New Lock free, scalable WAL design
The Write Ahead Log (WAL) is one of the most important components of a database. All the changes to data files are logged in the WAL (called the redo log in InnoDB). This allows to postpone the moment when the modified pages are flushed to disk, still protecting from data losses.
database  concurrency  storage  non-blocking 
july 2018 by mpm
libzbc: ZBC device manipulation library
libzbc is a simple library providing functions for manipulating disks supporting the Zoned Block Command (ZBC) and Zoned-device ATA command set (ZAC) disks.
june 2018 by mpm
Algorithms Behind Modern Storage Systems
This article takes a closer look at two storage system design approaches used in a majority of modern databases—read-optimized B-trees and write-optimized LSM (log-structured merge)-trees—and describes their use cases and tradeoffs.
storage  datastructure 
june 2018 by mpm
Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems
The ever-growing amount of data requires highly scalable storage solutions. The most flexible approach is to use storage pools that can be expanded and scaled down by adding or removing storage devices. To make this approach usable, it is necessary to provide a solution to locate data items in such a dynamic environment. This article presents and evaluates the Random Slicing strategy, which incorporates lessons learned from table-based, rule-based, and pseudo-randomized hashing strategies and is able to provide a simple and efficient strategy that scales up to handle exascale data. Random Slicing keeps a small table with information about previous storage system insert and remove operations, drastically reducing the required amount of randomness while delivering a perfect load distribution.
storage  scalability 
june 2018 by mpm
UnQLite is a in-process software library which implements a self-contained, serverless, zero-configuration, transactional NoSQL database engine. UnQLite is a document store database similar to MongoDB, Redis, CouchDB etc. as well a standard Key/Value store similar to BerkeleyDB, LevelDB, etc.
storage  database 
january 2018 by mpm
PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees
Key-value stores such as LevelDB and RocksDB offer excellent write throughput, but suffer high write amplification. The write amplification problem is due to the Log-Structured Merge Trees data structure that underlies these key-value stores. To remedy this problem, this paper presents a novel data structure that is inspired by Skip Lists, termed Fragmented Log-Structured Merge Trees (FLSM). FLSM introduces the notion of guards to organize logs, and avoids rewriting data in the same level. We build PebblesDB, a highperformance key-value store, by modifying HyperLevelDB to use the FLSM data structure. We evaluate PebblesDB using micro-benchmarks and show that for write-intensive workloads, PebblesDB reduces write amplification by 2.4-3× compared to RocksDB, while increasing write throughput by 6.7×. We modify two widely-used NoSQL stores, MongoDB and HyperDex, to use PebblesDB as their underlying storage engine. Evaluating these applications using the YCSB benchmark shows that throughput is increased by 18-105% when using PebblesDB (compared to their default storage engines) while write IO is decreased by 35-55%.
database  storage 
october 2017 by mpm
Modern B-Tree Techniques
Invented about 40 years ago and called ubiquitous less than 10 years later, B-tree indexes have been used in a wide variety of computing systems from handheld devices to mainframes and server farms. Over the years, many techniques have been added to the basic design in order to improve efficiency or to add functionality. Examples include
separation of updates to structure or contents, utility operations such as non-logged yet transactional index creation, and robust query processing such as graceful degradation during index-to-index navigation.
database  datastructure  storage 
october 2017 by mpm
The log abstraction carries with it two important promises that are difficult to fulfill at scale: highly available and durable record storage, and a repeatable total order on those records. LogDevice attempts to deliver on those two promises, so dear to the heart of a distributed system designer, at an essentially unlimited scale. It is a distributed data store designed specifically for logs
logging  messaging  storage 
august 2017 by mpm
EC-Cache: Load-balanced, Low-latency Cluster Caching with Online Erasure Coding
EC-Cache is a load-balanced, low latency cluster cache that uses online erasure coding to overcome the limitations of selective replication. EC-Cache employs erasure coding by: (i) splitting and erasure coding individual objects during writes, and (ii) late binding, wherein obtaining any k out of (k+r) splits of an object are sufficient, during reads. As compared to selective replication, EC-Cache improves load balancing by more than 3x and reduces the median and tail read latencies by more than 2x for typical parameters, while using the same amount of memory. EC-Cache does so using 10% additional bandwidth and a small increase in the amount of stored metadata. The benefits offered by EC-Cache are further amplified in the presence of background network load imbalance and server failures.
storage  availability  caching 
august 2017 by mpm
A simple disk-based hash table (i.e., persistent hash table).

It is a hashtable implemented on memory-mapped disk, so that it can be loaded with a single mmap() system call and used in memory directly (being as fast as an in-memory hashtable once it is loaded from disk).
storage  database  memory  datastructure  c++  python 
july 2017 by mpm
Storage Performance Development Kit
The Storage Performance Development Kit (SPDK) provides a set of tools and libraries for writing high performance, scalable, user-mode storage applications. It achieves high performance by moving all of the necessary drivers into userspace and operating in a polled mode instead of relying on interrupts, which avoids kernel context switches and eliminates interrupt handling overhead
storage  performance 
march 2017 by mpm
File-system fault injection framework for distributed storage systems
storage  testing  availability 
march 2017 by mpm
Efficient and Modular Consensus-Free Reconfiguration for Fault-Tolerant Storage
Quorum systems are useful tools for implementing consistent and available storage in the presence of failures. These systems usually comprise a static set of servers that provide a fault-tolerant read/write register accessed by a set of clients. We consider a dynamic variant of these systems and propose FreeStore, a set of fault-tolerant protocols that emulates a register in dynamic asynchronous systems in which processes are able to join/leave the servers set during the execution. These protoco...
consensus  replication  storage  fault-tolerance 
july 2016 by mpm
BTrDB: Optimizing Storage System Design for Timeseries Processing
The increase in high-precision, high-sample-rate telemetry timeseries poses a problem for existing timeseries databases which can neither cope with the throughput demands of these streams nor provide the necessary primitives for effective analysis of them. We present a novel abstraction for telemetry timeseries data and a data structure for providing this abstraction: a time-partitioning version-annotated copy-on-write tree. An implementation in Go is shown to outperform existing solutions, demonstrating a throughput of 53 million inserted values per second and 119 million queried values per second on a four-node cluster. The system achieves a 2.9x compression ratio and satisfies statistical queries spanning a year of data in under 200ms, as demonstrated on a year-long production deployment storing 2.1 trillion data points. The principles and design of this database are generally applicable to a large variety of timeseries types and represent a significant advance in the development of technology for the Internet of Things
database  storage  time 
may 2016 by mpm
SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage
The distinguishing feature of SkimpyStash is the design goal of extremely low RAM footprint at about 1 (±0.5) byte per key-value pair, which is more aggressive than earlier designs
storage  database 
april 2016 by mpm
Welcome to the open source project for creating new interfaces for non-volatile memory (like flash).
io  storage 
april 2016 by mpm
Our goal is a robust & reliable, distributed, highly available(*), large file store based upon write-once registers, append-only files, Chain Replication, and client-server style architecture
storage  replication 
november 2015 by mpm
ForestDB is a key-value storage engine that is developed by Couchbase Caching and Storage Team, and its main index structure is built from Hierarchical B+-Tree based Trie, called HB+-Trie
c++  storage  database 
november 2015 by mpm
a numeric time-series database
c++  metrics  storage 
october 2015 by mpm
Efficient Tabular Storage
We discuss efficient techniques for on-disk storage of tabular data
august 2015 by mpm
Scaling Concurrent Log-Structured Data Stores
Log-structured data stores (LSM-DSs) are widely accepted as the state-of-the-art implementation of key-value stores. They replace random disk writes with sequential I/O, by accumulating large batches of updates in an in-memory data structure and merging it with the on-disk store in the background. While LSM-DS implementations proved to be highly successful at masking the I/O bottleneck, scaling them up on multicore CPUs remains a challenge. This is nontrivial due to their often rich APIs, as well as the need to coordinate the RAM access with the background I/O.

We present cLSM, an algorithm for scalable concurrency in LSM-DS, which exploits multiprocessor-friendly data structures and non-blocking synchronization. cLSM supports a rich API, including consistent snapshot scans and general non-blocking read-modify-write operations.

We implement cLSM based on the popular LevelDB key-value store, and evaluate it using intensive synthetic workloads as well as ones from production web-serving applications. Our algorithm outperforms state of the art LSM-DS implementations, improving throughput by 1.5x to 2.5x. Moreover, cLSM demonstrates superior scalability with the number of cores (successfully exploiting twice as many cores as the competition)
database  storage  performance 
may 2015 by mpm
Pistachio: co-locate the data and compute for fastest cloud compute
Pistachio is a distributed key value store system. Data can be replicated with n replicas with strong consistency guarantees. Up to (n-1) failures can be tolerated. It’s being used as the user profile storage for large scale ads serving products within Yahoo. 10+ billions of user profiles are being stored with ~2 million reads QPS, 0.8GB/s read throughput and ~0.5 million writes QPS, 0.3GB/s write throughput. Average latency is under 1ms. It guarantees strong consistency and fault-tolerance. We have hundreds of servers in 8 data centers all over the globe supporting hundreds of millions in revenue.
dht  storage  consistency 
may 2015 by mpm
A fast file-based append-only object store, using memory mapped files.
java  storage 
april 2015 by mpm
Algorithms for Replica Placement in High-Availability Storage
A new model of causal failure is presented, and used to solve a novel replica placement problem in data centers. The model describes dependencies among system components as a directed graph. A replica placement is defined as a subset of vertices in such a graph. A criterion for optimizing replica placements is formalized and explained. In this work, the optimization goal is to avoid choosing placements in which a single event is likely to wipe out multiple replicas. Using this criterion, a fast algorithm is given for the scenario in which the dependency model is a tree. The main contribution of the paper is an O(n+ρlogρ) dynamic programming algorithm for placing ρ replicas on a tree with n vertices. This algorithm exhibits the interesting property that only two subproblems need to be recursively considered at each stage. An O(n2ρ) greedy algorithm is also briefly reported.
availability  storage 
march 2015 by mpm
Sheepdog Project
Sheepdog is a distributed object storage system for volume and container services and manages the disks and nodes intelligently. Sheepdog features ease of use, simplicity of code and can scale out to thousands of nodes
march 2015 by mpm
Akumuli is embedded time-series database, without dependency on third-party software or services, that implements custom storage engine designed specifically for time series data.
time  database  storage 
october 2014 by mpm
Muninn: a Versioning Key-Value Store using Object-based Storage Model
While non-volatile memory (NVRAM) devices have the potential to alleviate the trade-off between performance, scalability, and energy in storage and memory subsystems, a block interface and storage subsystems designed for slow I/O devices make it difficult to efficiently exploit NVRAMs in a portable and extensible way.

We propose an object-based storage model as a way of addressing the shortfalls of the current interfaces. Through the design of Muninn, an object-based versioning key-value store, we demonstrate that an in-device NVRAM management layer can be as efficient as that of NVRAM-aware key-value stores while not requiring host resources or host changes, and enabling tightly-coupled optimizations. As a key-value store, Muninn is designed to achieve better lifetime and low read/write amplification by eliminating internal data movements and per-object metadata updates using Bloom filters and hash functions. By doing so, it achieves as little as 1.5 flash page reads per look up and 0.26 flash page writes per insert on average with 50 million 1 KB key-value pairs without incurring data re-organization. This is close to the minimum number of flash accesses required to read and store the 1 KB key-value pairs, thus increasing performance and lifetime of flash media.
storage  io  performance 
july 2014 by mpm
NVMKV: A Scalable and Lightweight Flash Aware Key-Value Store
State-of-the-art flash-optimized KV stores frequently rely upon a log structure and/or compaction-based strategy to optimally organize content on flash. However, these strategies lead to excessive I/O, beyond the write amplification generated within the flash itself, with both the application and the flash device constantly rearranging data. In this paper, we explore the other extreme in the design space: minimal data management at the KV store and heavy reliance on the Flash Translation Layer (FTL) capabilities. NVMKV is a scalable and lightweight KV store that leverages advanced capabilities that are becoming available in modern FTLs. We demonstrate that NVMKV (i) performs KV operations at close to native device access speeds for get operations, (ii) outperforms state of the art KV stores by 50%-300%, (iii) significantly improves performance predictability for the YCSB KV benchmark when compared with the popular LevelDB KV store, and (iv) reduces data written to flash by as much as 1.7X and 29X for sequential and random write workloads relative to LevelDB, thereby dramatically increasing device lifetime.
storage  performance 
june 2014 by mpm
Linux Asynchronous I/O Explained
Asynchronoes I/O (AIO) is a method for performing I/O operations so that the process that issued an I/O request is not blocked till the data is available. Instead, after an I/O request is submitted, the process continues to execute its code and can later check the status of the submitted request. Linux kernel provides only *5* system calls for performing asynchronoes I/O. Other AIO functions commonly descibed in the literature are implemented in the user space libraries and use the system calls internally. Some libraries can also emulate AIO functionality entirely in the user space without any kernel support. There are two main libraries in Linux that facilitate AIO, we will refer to them as *libaio* and *librt* (the latter one is a part of libc). In this text, I first discuss system calls, then libaio, and finaly librt.
linux  storage 
april 2014 by mpm
The Bw-Tree: A B-tree for New Hardware Platforms
The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B-tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage.
datastructure  memory  storage 
april 2014 by mpm
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency
Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time. WAS customers have access to their data from anywhere at any time and only pay for what they use and store. In WAS, data is stored durably using both local and geographic replication to facilitate disaster recovery. Currently, WAS storage comes in the form of Blobs (files), Tables (structured storage), and Queues (message delivery). In this paper, we describe the WAS architecture, global namespace, and data model, as well as its resource provisioning, load balancing, and replication systems.
consistency  availability  storage 
march 2014 by mpm
Coding for SSDs
I want to make solid-state drives (SSDs) the optimal storage solution for my key-value store project. For that reason, I had to make sure I fully understood how SSDs work, so that I can optimize my hash table implementation to suit their internal characteristics. There is a lot of incomplete and contradictory information out there, and finding trustworthy information about SSDs was not an easy task. I had to do some substantial reading to find the proper publications and benchmarks in order to convince myself that, if I had to be coding for SSDs, I would know what I was doing.
storage  performance 
february 2014 by mpm
Reflections on Operating System Support for Database Management
The abstractions provided by operating systems can hinder the development of efficient databases. OS abstractions are essential for building user applications. It would be highly inefficient, in terms of development time, if every application required the implementation of its own kernel. The OS provides a foundation from which all applications can build upon. However, this general framework does not come free. Performance and the ability to utilize the full capacity of the hardware is traded for a general platform. Like all things in life, there is a tension between two opposing forces. In 1981, Michael Stonebraker wrote about the tension between operating systems and databases. These are my thoughts.
storage  database 
february 2014 by mpm
Disks from the Perspective of a File System
This article examines the shortcuts that disks take and the hoops that file systems must jump through to get the desired reliability.
filesystem  storage 
february 2014 by mpm
The Design and Implementation of Modern Column-Oriented Database Systems
In this article, we survey recent research on column-oriented database systems, or column-stores, where each attribute of a table is stored in a separate file or region on storage. Such databases have seen a resurgence in recent years with a rise in interest in analytic queries that perform scans and aggregates over large portions of a few columns of a table. The main advantage of a column-store is that it can access just the columns needed to answer such queries. We specifically focus on three influential research prototypes, MonetDB [46], VectorWise [18],and C-Store [88]. These systems have formed the basis for several well-known commercial column-store implementations. We describe their similarities and differences and discuss their specific architectural features for compression, late materialization, join processing, vectorization and adaptive indexing (database cracking).
storage  database 
december 2013 by mpm
Aether: A Scalable Approach to Logging
The shift to multi-core hardware brings new challenges to database systems, as the software parallelism determines performance. Even though database systems traditionally accommodate simultaneous requests, a multitude of synchronization barriers serialize execution. Write-ahead logging is a fundamental, omnipresent component in ARIES-style concurrency and recovery, and one of the most important yet-to-be addressed potential bottlenecks, especially in OLTP workloads making frequent small changes to data. In this paper, we identify four logging-related impediments to database system scalability. Each issue challenges different level in the software architecture: (a) the high volume of small-sized I/O requests may saturate the disk, (b) transactions hold locks while waiting for the log flush, (c) extensive context switching overwhelms the OS scheduler with threads executing log I/Os, and (d) contention appears as transactions serialize accesses to in-memory log data structures. We demonstrate these problems and address them with techniques that, when combined, comprise a holistic, scalable approach to logging. Our solution achieves a 20%-69% speedup over a modern database system when running log-intensive workloads, such as the TPC-B and TATP benchmarks. Moreover, it achieves log insert throughput over 1.8GB/s for small log records on a single socket server, an order of magnitude higher than the traditional way of accessing the log using a single mutex
storage  database 
november 2013 by mpm
A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster
Erasure codes, such as Reed-Solomon (RS) codes, are being increasingly employed in data centers to combat the cost of reliably storing large amounts of data. Although these codes provide optimal storage efficiency, they require significantly high network and disk usage during recovery of missing data. In this paper, we first present a study on the impact of recovery operations of erasure-coded data on the data-center network, based on measurements from Facebook's warehouse cluster in production. To the best of our knowledge, this is the first study of its kind available in the literature. Our study reveals that recovery of RS-coded data results in a significant increase in network traffic, more than a hundred terabytes per day, in a cluster storing multiple petabytes of RS-coded data. To address this issue, we present a new storage code using our recently proposed "Piggybacking" framework, that reduces the network and disk usage during recovery by 30% in theory, while also being storage optimal and supporting arbitrary design parameters. The implementation of the proposed code in the Hadoop Distributed File System (HDFS) is underway. We use the measurements from the warehouse cluster to show that the proposed code would lead to a reduction of close to fifty terabytes of cross-rack traffic per day
storage  reliability 
october 2013 by mpm
sophia - an embeddable key-value database designed for a highload
Sophia is a modern embeddable key-value database designed for a high load environment
database  storage 
september 2013 by mpm
Erasure Coding For Distributed Storag Wiki
This wiki is meant to provide a service to researchers by summarizing the main developments in problems relating to coding and network coding for distributed storage. Our hope is that this site can serve as an online paper repository and resource for researchers in the field.
storage  erasure-coding 
august 2013 by mpm
Symas Lightning MDB
LMDB is an ultra-fast, ultra-compact key-value data store developed by Symas for the OpenLDAP Project. It uses memory-mapped files, so it has the read performance of a pure in-memory database while still offering the persistence of standard disk-based databases, and is only limited to the size of the virtual address space
database  storage 
august 2013 by mpm
XORing Elephants: Novel Erasure Codes for Big Data
Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability. This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes. We show analytically that our codes are optimal on a recently identified tradeoff between locality and minimum distance. We implement our new codes in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2x on the repair disk I/O and repair network traffic. The disadvantage of the new coding scheme is that it requires 14% more storage compared to Reed-Solomon codes, an overhead shown to be information theoretically optimal to obtain locality. Because the new codes repair failures faster, this provides higher reliability, which is orders of magnitude higher compared to replication
storage  data  performance  availability 
july 2013 by mpm
Inside HyperLevelDB
While stock LevelDB is an excellent foundation, our experience with HyperDex identified several opportunities for further performance improvements. This article describes the changes we've made to LevelDB meet HyperDex clients' demanding needs
database  io  concurrency  storage  non-blocking 
june 2013 by mpm
Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging
In this paper we present a simple and efficient method, called ARIES ( Algorithm for Recouery and Isolation Exploiting Semantics), which supports partial rollbacks of transactions, finegranularity (e.g., record) locking and recovery using write-ahead logging (WAL). We introduce the paradigm of repeating history to redo all missing updates before performing the rollbacks of the loser transactions during restart after a system failure. ARIES uses a log sequence number in each page to correlate the state of a page with respect to logged updates of that page. All updates of a transaction are logged, including those performed during rollbacks. By appropriate chaining of the log records written during rollbacks to those written during forward progress, a bounded amount of logging is ensured during rollbacks even in the face of repeated failures during restart or of nested rollbacks
storage  database 
may 2013 by mpm
An efficient multi-tier tablet server storage architecture
Distributed, structured data stores such as Big Table, HBase, and Cassandra use a cluster of machines, each running a database-like software system called the Tablet Server Storage Layer or TSSL. A TSSL's performance on each node directly impacts the performance of the entire cluster. In this paper we introduce an efficient, scalable, multi-tier storage architecture for tablet servers. Our system can use any layered mix of storage devices such as Flash SSDs and magnetic disks. Our experiments show that by using a mix of technologies, performance for certain workloads can be improved beyond configurations using strictly two-tier approaches with one type of storage technology. We utilized, adapted, and integrated cache-oblivious algorithms and data structures, as well as Bloom filters, to improve scalability significantly. We also support versatile, efficient transactional semantics. We analyzed and evaluated our system against the storage layers of Cassandra and Hadoop HBase. We used wide range of workloads and configurations from read- to write-optimized, as well as different input sizes. We found that our system is 3--10× faster than existing systems; that using proper data structures, algorithms, and techniques is critical for scalability, especially on modern Flash SSDs; and that one can fully support versatile transactions without sacrificing performance
storage  database 
april 2013 by mpm
The Bw-Tree: A B-tree for New Hardware
The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance
database  storage 
april 2013 by mpm
This project is a building block that implements a simple single-node store system to use as a building block for distributed storage systems
java  storage 
march 2013 by mpm
A realistic distributed storage system: the rack model
In a realistic distributed storage environment, storage nodes are usually placed in racks, a metallic support designed to accommodate electronic equipment. It is known that the communication (bandwidth) cost between nodes which are in the same rack is much lower than between nodes which are in different racks. In this paper, a new model, where the storage nodes are placed in two racks, is proposed and analyzed. Moreover, the two-rack model is generalized to any number of racks. In this model, the storage nodes have different repair costs depending on the rack where they are placed. A threshold function, which minimizes the amount of stored data per node and the bandwidth needed to regenerate a failed node, is shown. This threshold function generalizes the ones given for previous distributed storage models. The tradeoff curve obtained from this threshold function is compared with the ones obtained from the previous models, and it is shown that this new model outperforms the previous ones in terms of repair cost
march 2013 by mpm
MapDB provides concurrent TreeMap and HashMap backed by disk storage.
java  storage  database 
november 2012 by mpm
Stasis is a flexible transactional storage library that is geared toward high-performance applications and system developers. It supports concurrent transactional storage, and no-FORCE/STEAL buffer management
storage  concurrency 
november 2012 by mpm
SILT: A Memory-Efficient, High-Performance Key-Value Store
SILT (Small Index Large Table) is a memory-efficient, high performance key-value store system based on flash storage that scales to serve billions of key-value items on a single node. It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each.
october 2012 by mpm
Perses: Data Layout for Low Impact Failures
PERSES reduces the length of degradation from the reference frame of the user by clustering data on disks such that working sets are kept together as much as possible. During a device failure, this co-location reduces the number of impacted working sets. PERSES uses statistical properties of data accesses to automatically determine which data to co-locate, avoiding extra administrative overhead
data  storage  availability  fault-removal  mttr 
october 2012 by mpm
Disks from the Perspective of a File System
File systems need to be aware of the disk technology on which they are running to ensure that they can reliably deliver the semantics that they have promised
september 2012 by mpm
The GetData Project is the reference implementation of the Dirfile Standards, a filesystem-based, column-oriented database format for time-ordered binary data. The Dirfile database format is designed to provide a fast, simple format for storing and reading data.
database  storage 
august 2012 by mpm
Akiban Persistit
Akiban PersistitTM is an open source transactional Java B+ tree library
storage  database 
august 2012 by mpm
SSTable and Log Structured Storage: LevelDB
SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads
database  datastructure  io  storage 
july 2012 by mpm
RealTime Data Compression: LZ4
LZ4 is a very fast compressor, based on well-known LZ77 (Lempel-Ziv) algorithm. Originally a fork from LZP2, it provides better compression ratio for text files and reaches impressive decompression speed, in the range and beyond 1GB/s per core (!), especially for binary files. These speeds are scalable with multi-threading modes, quickly reaching RAM speed limits on multi-core systems
storage  performance 
june 2012 by mpm
SQLCipher is an open source extension to SQLite that provides transparent 256-bit AES encryption of database files
database  confidentiality  storage 
april 2012 by mpm
Boxwood: Abstractions as the Foundation for Storage Infrastructure
We have built a system called Boxwood to explore the feasibility and utility of providing high-level abstractions or data structures as the fundamental storage infrastructure
distributed  consensus  datastructure  storage  fault-tolerance 
march 2012 by mpm
Efficient Replica Maintenance for Distributed Storage Systems
This paper considers replication strategies for storage systems that aggregate the disks of many nodes spread over the Internet. Maintaining replication in such systems can be prohibitively expensive, since every transient network or host failure could potentially lead to copying a server's worth of data over the Internet to maintain replication levels.
storage  availability  fault-tolerance 
march 2012 by mpm
Project CCNx® exists to develop, promote, and evaluate a new approach to communication architecture we call content-centric networking.
networking  ccn  storage 
march 2012 by mpm
This erlang-based storage engine implements a structure somewhat like LSM-trees (Log-Structured Merge Trees)
erlang  storage  datastructure 
january 2012 by mpm
Log structured storage
It originated as Log Structured File Systems in the 1980s, but more recently it's seeing increasing use as a way to structure storage in database engines
algorithm  storage  filesystem 
january 2012 by mpm
This is a rewrite (port) of LevelDB in Java. This goal is to have a feature complete implementation that is within 10% of the performance of the C++ original and produces byte-for-byte exact copies of the C++ code.
java  storage  database 
november 2011 by mpm
SSD Reading List
Flash-based Database Reading List
october 2011 by mpm
A fast key-value Database Storage Engine,Distributable B+Tree Structured Indexes , Level-LRU,and support Range-Query.
storage  database 
october 2011 by mpm
HawtJournal is a journal storage implementation based on append-only rotating logs and checksummed variable-length records, with fixed concurrent reads, full concurrent writes, dynamic batching and "dead" logs cleanup.
java  storage 
september 2011 by mpm
Disk-Locality in Datacenter Computing Considered Irrelevant
This paper takes the position that disk-locality is going to be irrelevant in cluster computing, and considers the implications this will have on datacenter computing research.
storage  networking 
july 2011 by mpm
Erlang LevelDB Driver
erlang  database  storage 
may 2011 by mpm
Erlang bindings to Google's Snappy compression library
erlang  storage 
may 2011 by mpm
An ACID compliant key-value storage library in the style of dbm. It utilizes immutable B+Trees in an append-only design and is distributed under LGPL
database  storage 
april 2011 by mpm
ddar is a free de-duplicating archiver for Unix
storage  home-it 
april 2011 by mpm
Whisper is a fixed-size database, similar in design to RRD (round-robin-database). It provides fast, reliable storage of numeric data over time
database  metrics  storage 
april 2011 by mpm
It is a durable, consistent, high-performance key-value data store built out of a background-garbage-collected log-structured B+Tree
storage  datastructure 
march 2011 by mpm
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression.
march 2011 by mpm
« earlier      
per page:    204080120160

Copy this bookmark: