mpm + database   149

Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.
database  event 
7 weeks ago by mpm
GNU Recutils
Recutils is a collection of tools, like recins, recdel, and recsel used to manage these recfiles/databases. They allow for all the normal basic relational database operations, typing, auto-incrementing, and even field-level crypto. All of this power is yours with the bonus that your database is a human-readable text file that you can grep/awk/sed freely, and a line-oriented structure makes it perfect for version control systems.
database  linux 
9 weeks ago by mpm
Message DB
A fully-featured event store and message store implemented in PostgreSQL for Pub/Sub, Event Sourcing, Messaging, and Evented Microservices applications.
postgresql  event  messaging  database 
january 2020 by mpm
The Concurrent 2-Trie
The concurrent 2-trie is a dictionary-like data structure that is designed and optimized specifically to be used for translation tables in file buffer pools. When the concurrent 2-trie replaced the lock-striped hop-scotch hash tables in the Neo4j file buffer pool, file buffer accesses became 30% faster.

The concurrent 2-trie optimizes for the domain of file buffer translation tables by making the following observations: First, the file page identifiers form a sequence, starting from zero until the last page in the file. Second, files can only grow by being extended at the end. Third, most database deployments are able to fit the majority of their data in memory, and often the data set fits entirely in memory. This means the translation tables are usually densely packed.
datastructure  memory  database 
january 2020 by mpm
Benefits of Stabilization versus Rollback in Eventually Consistent Key-Value Stores
In this paper, we evaluate and compare the performance of two approaches, namely self-stabilization and rollback, to handling consistency violation faults (cvf) that occurred when a distributed program is executed on eventually consistent key-value store. We observe that self-stabilization is usually better than rollbacks in our experiments. Moreover, when we aggressively allow more cvf in exchange of eliminating mechanisms for guaranteeing atomicity requirements of actions, we observe the programs in our case studies achieve a speedup between 2--15 times compared with the standard implementation. We also analyze different factors that contribute to the results. Our results and analysis are useful in helping a system designer choose proper design options for their program.
self-stabilization  consistency  database 
october 2019 by mpm
Atomic Durability - How do databases recover from a crash ?
In this post we're going to talk about the four different ways of having durable transactions.
database  consistency  safety  storage 
october 2019 by mpm
Designing an Efficient Replicated Log Store with Consensus Protocol
Highly available and high-performance message logging system is critical building block for various use cases that require global ordering, especially for deterministic distributed transactions. To achieve availability, we maintain multiple replicas that have the same payloads in exactly the same order. This introduces various challenging issues such as consistency between replicas after failure, while minimizing performance degradation. Replicated state machine-based consensus protocols are the most suitable candidates to fulfill those requirements, but double-write problem and different logging granularity make it hard to keep the system efficient. This paper suggests a novel way to build a replicated log store on top of Raft consensus protocol, aiming at providing the same level of consistency as well as fault-tolerance without sacrificing the throughput of the system.
consensus  consistency  replication  storage  database 
july 2019 by mpm
The Berkeley Tree DataBase provides very fast storage of scalar-valued timeseries data
database  time-series 
june 2019 by mpm
Distributed Transactional Systems Cannot Be Fast
We prove that no fully transactional system can provide fast read transactions (including read-only ones that are considered the most frequent in practice). Specifically, to achieve fast read transactions, the system has to give up support of transactions that write more than one object. We prove this impossibility result for distributed storage systems that are causally consistent, i.e., they do not require to ensure any strong form of consistency. Therefore, our result holds also for any system that ensures a consistency level stronger than causal consistency, e.g., strict serializability. The impossibility result holds even for systems that store only two objects (and support at least two servers and at least four clients). It also holds for systems that are partially replicated. Our result justifies the design choices of state-of-the-art distributed transactional systems and insists that system designers should not put more effort to design fully-functional systems that support both fast read transactions and ensure causal or any stronger form of consistency.
performance  database  consistency 
june 2019 by mpm
It’s Time to Move on from Two Phase Commit
In my opinion we need to remove veto power from workers and architect systems in which the system does not have freedom to abort a transaction
database  protocol  consistency 
may 2019 by mpm
Beam is a distributed knowledge graph store, sometimes called an RDF store or a triple store
rdf  database 
may 2019 by mpm
SSS: Scalable Key-Value Store with External Consistent and Abort-free Read-only Transactions
We present SSS, a scalable transactional key-value store deploying a novel distributed concurrency control that provides external consistency for all transactions, never aborts read-only transactions due to concurrency, all without specialized hardware. SSS ensures the above properties without any centralized source of synchronization. SSS's concurrency control uses a combination of vector clocks and a new technique, called snapshot-queuing, to establish a single transaction serialization order that matches the order of transaction completion observed by clients. We compare SSS against high performance key-value stores, Walter, ROCOCO, and a two-phase commit baseline. SSS outperforms 2PC-baseline by as much as 7x using 20 nodes; and ROCOCO by as much as 2.2x with long read-only transactions using 15 nodes.
storage  database 
january 2019 by mpm
The FuzzyLog: A Partially Ordered Shared Log
The FuzzyLog is a partially ordered shared log abstraction. Distributed applications can concurrently append to the partial order and play it back. FuzzyLog applications obtain the benefits of an underlying shared log – extracting strong consistency, durability, and failure atomicity in simple ways – without suffering from its drawbacks. By exposing a partial order, the FuzzyLog enables three key capabilities for applications: linear scaling for throughput and capacity (without sacrificing atomicity), weaker consistency guarantees, and tolerance to network partitions. We present Dapple, a distributed implementation of the FuzzyLog abstraction that stores the partial order compactly and supports efficient appends / playback via a new ordering protocol. We implement several data structures and applications over the FuzzyLog, including several map variants as well as a ZooKeeper implementation. Our evaluation shows that these applications are compact, fast, and flexible: they retain the simplicity (100s of lines of code) and strong semantics (durability and failure atomicity) of a shared log design while exploiting the partial order of the Fuzzy-Log for linear scalability, flexible consistency guarantees (e.g., causal+ consistency), and network partition tolerance. On a 6-node Dapple deployment, our FuzzyLog- based ZooKeeper supports 3M/sec single-key writes, and 150K/sec atomic cross-shard renames.
database  consistency  storage 
december 2018 by mpm
Keep The Data Where You Use It
The de facto method of keeping the data close to the users is full replication. Many fully replicated systems, however, still have a single region responsible for orchestrating the writes, making the data available locally only for reads and not the updates.
data  database  replication 
december 2018 by mpm
The FuzzyLog: Partially Ordered Shared Log
The FuzzyLog is a partially ordered shared log. Unlike traditional SMR systems, such as Paxos or Tango, which store all events in a single total order, the FuzzyLog allows the storage and update of partially ordered histories. This relaxation of ordering contraints enables richer application semantics around consistency guarentees, data partitioning and log-playback, while retaining the ease-of-programming of the shared-log model.
consistency  storage  database  datastructure 
august 2018 by mpm
MySQL 8.0: New Lock free, scalable WAL design
The Write Ahead Log (WAL) is one of the most important components of a database. All the changes to data files are logged in the WAL (called the redo log in InnoDB). This allows to postpone the moment when the modified pages are flushed to disk, still protecting from data losses.
database  concurrency  storage  non-blocking 
july 2018 by mpm
UnQLite is a in-process software library which implements a self-contained, serverless, zero-configuration, transactional NoSQL database engine. UnQLite is a document store database similar to MongoDB, Redis, CouchDB etc. as well a standard Key/Value store similar to BerkeleyDB, LevelDB, etc.
storage  database 
january 2018 by mpm
Main Memory Database Systems
Below are two resources that describe the landscape of modern main-memory database systems. The first is a survey/book from Foundations and Trends in Databases, and the second is a slide deck from a VLDB 2016 tutorial. The slides roughly match the content found in the survey. Feel free to contact me with any comments/errors/questions.
database  memory 
january 2018 by mpm
A Simple and Efficient Implementation for Small Databases
This paper describes a technique for implementing the sort of small databases that frequently occur in the design of operating systems and distributed systems. We take advantage of the existence of very large virtual memories, and quite large real memories, to make the technique feasible. We maintain the database as a strongly typed data structure in virtual memory, record updates incrementally on disk in a log and occasionally make a checkpoint of the entire database. We recover from crashes by restoring the database from an old checkpoint then replaying the log. We use existing packages to convert between strongly typed data objects and their disk representations, and to communicate strongly typed data across the network (using remote procedure calls). Our memory is managed entirely by a general purpose allocator and garbage collector.This scheme has been used to implement a name server for a distributed system. The resulting implementation has the desirable property of being simultaneously simple, efficient and reliable.
october 2017 by mpm
PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees
Key-value stores such as LevelDB and RocksDB offer excellent write throughput, but suffer high write amplification. The write amplification problem is due to the Log-Structured Merge Trees data structure that underlies these key-value stores. To remedy this problem, this paper presents a novel data structure that is inspired by Skip Lists, termed Fragmented Log-Structured Merge Trees (FLSM). FLSM introduces the notion of guards to organize logs, and avoids rewriting data in the same level. We build PebblesDB, a highperformance key-value store, by modifying HyperLevelDB to use the FLSM data structure. We evaluate PebblesDB using micro-benchmarks and show that for write-intensive workloads, PebblesDB reduces write amplification by 2.4-3× compared to RocksDB, while increasing write throughput by 6.7×. We modify two widely-used NoSQL stores, MongoDB and HyperDex, to use PebblesDB as their underlying storage engine. Evaluating these applications using the YCSB benchmark shows that throughput is increased by 18-105% when using PebblesDB (compared to their default storage engines) while write IO is decreased by 35-55%.
database  storage 
october 2017 by mpm
Modern B-Tree Techniques
Invented about 40 years ago and called ubiquitous less than 10 years later, B-tree indexes have been used in a wide variety of computing systems from handheld devices to mainframes and server farms. Over the years, many techniques have been added to the basic design in order to improve efficiency or to add functionality. Examples include
separation of updates to structure or contents, utility operations such as non-logged yet transactional index creation, and robust query processing such as graceful degradation during index-to-index navigation.
database  datastructure  storage 
october 2017 by mpm
DottedDB: Anti-Entorpy without Merkle Trees, Deletes without Tombstones
To achieve high availability in the face of network partitions, many distributed databases adopt eventual consistency, allow temporary conflicts due to concurrent writes, and use some form of per-key logical clock to detect and resolve such conflicts. Furthermore, nodes synchronize periodically to ensure replica convergence in a process called anti-entropy, normally using Merkle Trees. We present the design of DottedDB, a Dynamo-like key-value store, which uses a novel nodewide logical clock framework, overcoming three fundamental limitations of the state of the art: (1) minimize the metadata per key necessary to track causality, avoiding its growth even in the face of node churn; (2) correctly and durably delete keys, with no need for tombstones; (3) offer a lightweight antientropy mechanism to converge replicated data, avoiding the need for Merkle Trees. We evaluate DottedDB against MerkleDB, an otherwise identical database, but using per-key logical clocks and Merkle Trees for anti-entropy, to precisely measure the impact of the novel approach. Results show that: causality metadata per object always converges rapidly to only one id-counter pair; distributed deletes are correctly achieved without global coordination and with constant metadata; divergent nodes are synchronized faster, with less memory-footprint and with less communication overhead than using Merkle Trees.
database  coordination  consistency 
august 2017 by mpm
What is SKIP LOCKED for in PostgreSQL 9.5?
The main utility of SKIP LOCKED is for building simple, reliable and efficient concurrent work queues
messaging  concurrency  database 
july 2017 by mpm
A simple disk-based hash table (i.e., persistent hash table).

It is a hashtable implemented on memory-mapped disk, so that it can be loaded with a single mmap() system call and used in memory directly (being as fast as an in-memory hashtable once it is loaded from disk).
storage  database  memory  datastructure  c++  python 
july 2017 by mpm
A prototype of a Dynamo-style distributed key-value database, implementing Server Wide Clocks as the main causality mechanism across the system.
database  consistency 
june 2017 by mpm
LemonGraph is a log-based transactional graph (nodes/edges/properties) database engine that is backed by a single file. The primary use case is to support streaming seed set expansion.
database  graph 
june 2017 by mpm
Comdb2 is a clustered RDBMS built on Optimistic Concurrency Control techniques. It provides multiple isolation levels, including Snapshot and Serializable Isolation. Read/Write transactions run on any node, with the client library transparently negotiating connections to lowest cost (latency) node which is available. The client library provides transparent reconnect.
concurrency  database 
june 2017 by mpm
Spanner vs. Calvin
I found it very difficult to find cases where an ideal implementation of Spanner theoretically outperforms an ideal implementation of Calvin.
consistency  consensus  database  performance  scalability 
april 2017 by mpm
Beringei is a high performance, in-memory storage engine for time series data
time  database 
december 2016 by mpm
Eventually Consistent Transactions
We propose a novel consistency model based on eventually consistent transactions. Unlike serializable transactions, eventually consistent transactions are ordered by two order relations (visibility and arbitration) rather than a single order relation
consistency  database 
july 2016 by mpm
A Fast Lightweight Time-Series Store for IoT Data
In this paper, we present an efficient architecture for time-series data management that provides a high data ingestion rate, while still being sufficiently lightweight that it can be deployed in embedded environments or small virtual machines.
time  database 
july 2016 by mpm
BTrDB: Optimizing Storage System Design for Timeseries Processing
The increase in high-precision, high-sample-rate telemetry timeseries poses a problem for existing timeseries databases which can neither cope with the throughput demands of these streams nor provide the necessary primitives for effective analysis of them. We present a novel abstraction for telemetry timeseries data and a data structure for providing this abstraction: a time-partitioning version-annotated copy-on-write tree. An implementation in Go is shown to outperform existing solutions, demonstrating a throughput of 53 million inserted values per second and 119 million queried values per second on a four-node cluster. The system achieves a 2.9x compression ratio and satisfies statistical queries spanning a year of data in under 200ms, as demonstrated on a year-long production deployment storing 2.1 trillion data points. The principles and design of this database are generally applicable to a large variety of timeseries types and represent a significant advance in the development of technology for the Internet of Things
database  storage  time 
may 2016 by mpm
SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage
The distinguishing feature of SkimpyStash is the design goal of extremely low RAM footprint at about 1 (±0.5) byte per key-value pair, which is more aggressive than earlier designs
storage  database 
april 2016 by mpm
Readings in Database Systems, 5th Edition
Readings in Database Systems (commonly known as the "Red Book") has offered readers an opinionated take on both classic and cutting-edge research in the field of data management since 1988. Here, we present the Fifth Edition of the Red Book
book  database 
december 2015 by mpm
joedb is a minimalist embedded relational database, where data is manipulated directly in the target programming language, without using SQL. In joedb, the journal of all modifications is stored to disk. This way, the whole data history is remembered, and it is possible to re-create any past state of the database. It is also a way to make the system extremely simple and fast
c++  database 
november 2015 by mpm
An open-source service monitoring system and time series database.
time  database 
november 2015 by mpm
ForestDB is a key-value storage engine that is developed by Couchbase Caching and Storage Team, and its main index structure is built from Hierarchical B+-Tree based Trie, called HB+-Trie
c++  storage  database 
november 2015 by mpm
AtlasDB is a transactional layer on top of a key value store. When designing a data store to be scalable, transactions are usually the first feature to be cut. However they are one of the most useful features for developers. AtlasDB allows any key value store that supports durable writes to have transactions
october 2015 by mpm
Scaling Concurrent Log-Structured Data Stores
Log-structured data stores (LSM-DSs) are widely accepted as the state-of-the-art implementation of key-value stores. They replace random disk writes with sequential I/O, by accumulating large batches of updates in an in-memory data structure and merging it with the on-disk store in the background. While LSM-DS implementations proved to be highly successful at masking the I/O bottleneck, scaling them up on multicore CPUs remains a challenge. This is nontrivial due to their often rich APIs, as well as the need to coordinate the RAM access with the background I/O.

We present cLSM, an algorithm for scalable concurrency in LSM-DS, which exploits multiprocessor-friendly data structures and non-blocking synchronization. cLSM supports a rich API, including consistent snapshot scans and general non-blocking read-modify-write operations.

We implement cLSM based on the popular LevelDB key-value store, and evaluate it using intensive synthetic workloads as well as ones from production web-serving applications. Our algorithm outperforms state of the art LSM-DS implementations, improving throughput by 1.5x to 2.5x. Moreover, cLSM demonstrates superior scalability with the number of cores (successfully exploiting twice as many cores as the competition)
database  storage  performance 
may 2015 by mpm
Weaver is a distributed graph store that provides horizontal scalability, high-performance, and strong consistency
graph  database 
april 2015 by mpm
From ARIES to MARS: Reengineering Transaction Management for Next-Generation, Solid-State Drives
Systems that provide powerful transaction mechanisms often rely on write-ahead logging (WAL) implementations that were designed with slow, disk-based systems in mind. The emerging class of fast, byte-addressable, non-volatile memory (NVM) technologies (e.g., phase change memories, spin-torque MRAMs, and the memristor), however, present performance characteristics very different from both disks and flash-based SSDs. This paper addresses the problem of designing a WAL scheme optimized for these fast NVM-based storage systems. We examine the features that a system like ARIES, a WAL algorithm popular for databases, must provide and separate them from the implementation decisions ARIES makes to optimize for disk-based systems. We design a new NVMoptimized WAL scheme (called MARS) in tandem with a novel SSD multi-part atomic write primitive that combine to provide the same features as ARIES does without any of the disk-centric limitations. The new atomic write primitive makes the log’s contents visible to the application, allowing for a simpler and faster implementation. MARS provides atomicity, durability, and high performance by leveraging the enormous internal bandwidth and high degree of parallelism that advanced SSDs will provide. We have implemented MARS and the novel visible atomic write primitive in a next-generation SSD. This paper demonstrates the overhead of the primitive is minimal compared to normal writes, and our hardware provides large speedups for transactional updates to hash tables, b-trees, and large graphs. MARS outperforms ARIES by up to 3.7 ⇥ while reducing software complexity
database  io  performance 
january 2015 by mpm
m-LIGHT: Indexing Multi-Dimensional Data over DHTs
In this paper, we study the problem of indexing multidimensional data in the P2P networks based on distributed hash tables (DHTs). We identify several design issues and propose a novel over-DHT indexing scheme called m-LIGHT. To preserve data locality, m-LIGHT employs a clever naming mechanism that gracefully maps the index tree into the underlying DHT so that it achieves efficient index maintenance and query processing. Moreover, m-LIGHT leverages a new data-aware index splitting strategy to achieve optimal load balance among peer nodes. We conduct an extensive performance evaluation for m-LIGHT. Compared to the state-of-the-art indexing schemes, m-LIGHT substantially saves the index maintenance overhead, achieves a more balanced load distribution, and improves the range query performance in both bandwidth consumption and response latency
dht  database 
january 2015 by mpm
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.
database  testing  outage 
november 2014 by mpm
Akumuli is embedded time-series database, without dependency on third-party software or services, that implements custom storage engine designed specifically for time series data.
time  database  storage 
october 2014 by mpm
Salt: Combining ACID and BASE in a Distributed Database
This paper presents Salt, a distributed database that allows developers to improve the performance and scalability of their ACID applications through the incremental adoption of the BASE approach. Salt’s motivation is rooted in the Pareto principle: for many applications, the transactions that actually test the performance limits of ACID are few. To leverage this insight, Salt introduces BASE transactions, a new abstraction that encapsulates the workflow of performance-critical transactions. BAS...
consistency  database  base  transactions 
october 2014 by mpm
High performance persistent LSM key-value store library for the JVM
java  database 
october 2014 by mpm
A list of papers essential to understanding databases and building new data systems.
september 2014 by mpm
Unlike other key-value databases, hamsterdb knows about the type of the keys and will use that information to optimize storage and algorithms. A database storing integer keys uses a completely different memory layout than variable length binary keys. This memory layout drastically reduces the file size, reduces I/O, increases performance and improves scalability
august 2014 by mpm
Time-Series Database Requirements
Because I have my own ideas about what constitutes a good time-series database, and because a few people have asked me to describe my requirements, I have decided to publish my thoughts here.
time  database  performance 
june 2014 by mpm
Scalable Atomic Visibility with RAMP Transactions
We’ve developed three new algorithms—called Read Atomic Multi-Partition (RAMP) Transactions—for ensuring atomic visibility in partitioned (sharded) databases: either all of a transaction’s updates are observed, or none are.
concurrency  database 
april 2014 by mpm
Copysets and Chainsets: A Better Way to Replicate
The traditional technique for performing such partitioning and replication is to randomly assign data to replicas. Although such random assignment is relatively easy to implement, it suffers from a fatal drawback: as cluster size grows, it becomes almost guaranteed that a failure of a small percentage of the cluster will lead to permanent data loss.
database  replication  availability 
february 2014 by mpm
Reflections on Operating System Support for Database Management
The abstractions provided by operating systems can hinder the development of efficient databases. OS abstractions are essential for building user applications. It would be highly inefficient, in terms of development time, if every application required the implementation of its own kernel. The OS provides a foundation from which all applications can build upon. However, this general framework does not come free. Performance and the ability to utilize the full capacity of the hardware is traded for a general platform. Like all things in life, there is a tension between two opposing forces. In 1981, Michael Stonebraker wrote about the tension between operating systems and databases. These are my thoughts.
storage  database 
february 2014 by mpm
A distributed SQL database with the scalability of a KV store.
february 2014 by mpm
The Design and Implementation of Modern Column-Oriented Database Systems
In this article, we survey recent research on column-oriented database systems, or column-stores, where each attribute of a table is stored in a separate file or region on storage. Such databases have seen a resurgence in recent years with a rise in interest in analytic queries that perform scans and aggregates over large portions of a few columns of a table. The main advantage of a column-store is that it can access just the columns needed to answer such queries. We specifically focus on three influential research prototypes, MonetDB [46], VectorWise [18],and C-Store [88]. These systems have formed the basis for several well-known commercial column-store implementations. We describe their similarities and differences and discuss their specific architectural features for compression, late materialization, join processing, vectorization and adaptive indexing (database cracking).
storage  database 
december 2013 by mpm
This code is a library that forms the core building block for a fast key value server, especially suited for storing data on flash drives. It has an Log-Stuctured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor(WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor(SAF). It has multi-threaded compactions, making it specially suitable for storing multiple terabytes of data in a single database.
november 2013 by mpm
Aether: A Scalable Approach to Logging
The shift to multi-core hardware brings new challenges to database systems, as the software parallelism determines performance. Even though database systems traditionally accommodate simultaneous requests, a multitude of synchronization barriers serialize execution. Write-ahead logging is a fundamental, omnipresent component in ARIES-style concurrency and recovery, and one of the most important yet-to-be addressed potential bottlenecks, especially in OLTP workloads making frequent small changes to data. In this paper, we identify four logging-related impediments to database system scalability. Each issue challenges different level in the software architecture: (a) the high volume of small-sized I/O requests may saturate the disk, (b) transactions hold locks while waiting for the log flush, (c) extensive context switching overwhelms the OS scheduler with threads executing log I/Os, and (d) contention appears as transactions serialize accesses to in-memory log data structures. We demonstrate these problems and address them with techniques that, when combined, comprise a holistic, scalable approach to logging. Our solution achieves a 20%-69% speedup over a modern database system when running log-intensive workloads, such as the TPC-B and TATP benchmarks. Moreover, it achieves log insert throughput over 1.8GB/s for small log records on a single socket server, an order of magnitude higher than the traditional way of accessing the log using a single mutex
storage  database 
november 2013 by mpm
Optimizing Linux Memory Management for Low-latency / High-throughput Databases
The first part of the document provides the relevant background information: an outline of how GraphDB manages its data, the symptoms of our problem, and how the Linux Virtual Memory Management (VMM) subsystem works. In the second part of the document, we will detail the methodology, observations and conclusions from our experiments in getting to the root cause of the problem. We end with a summary of the lessons we have learned.
database  linux  performance  memory 
october 2013 by mpm
HypergraphDB is a general purpose, open-source data storage mechanism based on a powerful knowledge management formalism known as directed hypergraphs. While a persistent memory model designed mostly for knowledge management, AI and semantic web projects, it can also be used as an embedded object-oriented database for Java projects of all sizes. Or a graph database. Or a (non-SQL) relational database
database  graph 
september 2013 by mpm
Sky Behavioral Database
Sky is an open source database used for flexible, high performance analysis of behavioral data. For certain kinds of data such as clickstream data and log data, it can be several orders of magnitude faster than traditional approaches such as SQL databases or Hadoop.
september 2013 by mpm
sophia - an embeddable key-value database designed for a highload
Sophia is a modern embeddable key-value database designed for a high load environment
database  storage 
september 2013 by mpm
Symas Lightning MDB
LMDB is an ultra-fast, ultra-compact key-value data store developed by Symas for the OpenLDAP Project. It uses memory-mapped files, so it has the read performance of a pure in-memory database while still offering the persistence of standard disk-based databases, and is only limited to the size of the virtual address space
database  storage 
august 2013 by mpm
Non-Monotonic Snapshot Isolation
Many distributed applications require transactions. However, transactional protocols that require strong synchronization are costly in large scale environments. Two properties help with scalability of a transactional system: genuine partial replication (GPR), which leverages the intrinsic parallelism of a workload, and snapshot isolation (SI), which decreases the need for synchronization. We show that, under standard assumptions (data store accesses are not known in advance, and transactions may access arbitrary objects in the data store), it is impossible to have both SI and GPR. To circumvent this impossibility, we propose a weaker consistency criterion, called Non-monotonic Snapshot Isolation (NMSI). NMSI retains the most important properties of SI, i.e., read-only transactions always commit, and two write-conflicting updates do not both commit. We present a GPR protocol that ensures NMSI, and has lower message cost (i.e., it contacts fewer replicas and/or commits faster) than previous approaches.
consistency  availability  database 
july 2013 by mpm
Inside HyperLevelDB
While stock LevelDB is an excellent foundation, our experience with HyperDex identified several opportunities for further performance improvements. This article describes the changes we've made to LevelDB meet HyperDex clients' demanding needs
database  io  concurrency  storage  non-blocking 
june 2013 by mpm
Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging
In this paper we present a simple and efficient method, called ARIES ( Algorithm for Recouery and Isolation Exploiting Semantics), which supports partial rollbacks of transactions, finegranularity (e.g., record) locking and recovery using write-ahead logging (WAL). We introduce the paradigm of repeating history to redo all missing updates before performing the rollbacks of the loser transactions during restart after a system failure. ARIES uses a log sequence number in each page to correlate the state of a page with respect to logged updates of that page. All updates of a transaction are logged, including those performed during rollbacks. By appropriate chaining of the log records written during rollbacks to those written during forward progress, a bounded amount of logging is ensured during rollbacks even in the face of repeated failures during restart or of nested rollbacks
storage  database 
may 2013 by mpm
An efficient multi-tier tablet server storage architecture
Distributed, structured data stores such as Big Table, HBase, and Cassandra use a cluster of machines, each running a database-like software system called the Tablet Server Storage Layer or TSSL. A TSSL's performance on each node directly impacts the performance of the entire cluster. In this paper we introduce an efficient, scalable, multi-tier storage architecture for tablet servers. Our system can use any layered mix of storage devices such as Flash SSDs and magnetic disks. Our experiments show that by using a mix of technologies, performance for certain workloads can be improved beyond configurations using strictly two-tier approaches with one type of storage technology. We utilized, adapted, and integrated cache-oblivious algorithms and data structures, as well as Bloom filters, to improve scalability significantly. We also support versatile, efficient transactional semantics. We analyzed and evaluated our system against the storage layers of Cassandra and Hadoop HBase. We used wide range of workloads and configurations from read- to write-optimized, as well as different input sizes. We found that our system is 3--10× faster than existing systems; that using proper data structures, algorithms, and techniques is critical for scalability, especially on modern Flash SSDs; and that one can fully support versatile transactions without sacrificing performance
storage  database 
april 2013 by mpm
The Bw-Tree: A B-tree for New Hardware
The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance
database  storage 
april 2013 by mpm
KairosDB is a fast distributed scalable time series database written primarily for Cassandra but works with HBase as well. It is a rewrite of the original OpenTSDB project started at Stumble Upon. Many thanks go out to the original authors for laying the groundwork and direction for this great product.
database  java 
march 2013 by mpm
MapDB provides concurrent TreeMap and HashMap backed by disk storage.
java  storage  database 
november 2012 by mpm
Transaction storage for geo-replicated systems
A key feature behind Walter is a new property called Parallel Snapshot Isolation (PSI). PSI allows Walter to replicate data asynchronously, while providing strong guarantees within each site. PSI precludes write-write conflicts, so that developers need not worry about conflict-resolution logic
consistency  database 
october 2012 by mpm
The GetData Project is the reference implementation of the Dirfile Standards, a filesystem-based, column-oriented database format for time-ordered binary data. The Dirfile database format is designed to provide a fast, simple format for storing and reading data.
database  storage 
august 2012 by mpm
Akiban Persistit
Akiban PersistitTM is an open source transactional Java B+ tree library
storage  database 
august 2012 by mpm
Calvin: Fast Distributed Transactions for Partitioned Database Systems
By replicating transaction inputs rather than effects, Calvin is also able to support multiple consistency levels—including Paxos based strong consistency across geographically distant replicas—at no cost to transactional throughput.
consistency  database  replication 
july 2012 by mpm
Granola: Low-Overhead Distributed Transaction Coordination
This paper presents Granola, a transaction coordination infrastructure for building reliable distributed storage applications. Granola provides a strong consistency model, while significantly reducing transaction coordination overhead
database  consistency 
july 2012 by mpm
DBToaster creates query engines for embedding into applications that require real-time, low-latency data processing and monitoring capabilities. DBToaster-generated engines are optimized for long-lived queries, where query results must be kept up-to-date with rapidly changing input data. Using database terminology, DBToaster engines maintain in-memory materialized views. Our performance claims refer to the speed at which DBToaster engines refresh views as the input data changes
database  data 
july 2012 by mpm
SSTable and Log Structured Storage: LevelDB
SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads
database  datastructure  io  storage 
july 2012 by mpm
Readings in Databases
A list of papers essential to understanding databases and building new data systems.
june 2012 by mpm
Titan is a distributed graph database optimized for processing massive-scale graphs represented over a machine cluster. Titan separates the concerns of graph processing and manipulation from storing the graph on disk, delegating that concern to an extensible set of persistence solutions
graph  database 
june 2012 by mpm
Cache Craftiness for Fast Multicore Key-Value Storage
We present Masstree, a fast key-value database designed for SMP machines. Masstree keeps all data in memory. Its main data structure is a trie-like concatenation of B+-trees, each of which handles a fixed-length slice of a variable-length key. This structure effectively handles arbitrary-length possibly binary keys, including keys with long shared prefixes. B+-tree fanout was chosen to minimize total DRAM delay when descending the tree and prefetching each tree node. Lookups use optimistic concurrency control, a read-copy-update-like technique, and do not write shared data structures; updates lock only affected nodes. Logging and checkpointing provide consistency and durability.
performance  memory  database  concurrency 
may 2012 by mpm
SQLCipher is an open source extension to SQLite that provides transparent 256-bit AES encryption of database files
database  confidentiality  storage 
april 2012 by mpm
WiredTiger is an high performance, scalable, production quality, NoSQL, Open Source extensible platform for data management.WiredTiger supports both traditional row-oriented storage, where all columns of a row are stored together, and column-oriented storage, where one or more columns can be stored individually, allowing more efficient access and storage
april 2012 by mpm
« earlier      
per page:    204080120160

Copy this bookmark: