jm + databases   77

[no title]
'For decades, the transaction concept has played a central role in
database research and development. Despite this prominence, transactional
databases today often surface much weaker models than the
classic serializable isolation guarantee—and, by default, far weaker
models than alternative,“strong but not serializable” models such as
Snapshot Isolation. Moreover, the transaction concept requires the
programmer’s involvement: should an application programmer fail
to correctly use transactions by appropriately encapsulating functionality,
even serializable transactions will expose programmers
to errors. While many errors arising from these practices may be
masked by low concurrency during normal operation, they are susceptible
to occur during periods of abnormally high concurrency. By
triggering these errors via concurrent access in a deliberate attack, a
determined adversary could systematically exploit them for gain.
In this work, we defined the problem of ACIDRain attacks and
introduced 2AD, a lightweight dynamic analysis tool that uses traces
of normal database activity to detect possible anomalous behavior
in applications. To enable 2AD, we extended Adya’s theory of weak
isolation to allow efficient reasoning over the space of all possible
concurrent executions of a set of transactions based on a concrete
history, via a new concept called an abstract history, which also
applies to API calls. We then applied 2AD analysis to twelve popular
self-hosted eCommerce applications, finding 22 vulnerabilities
spread across all but one application we tested, affecting over 50%
of eCommerce sites on the Internet today.

We believe that the magnitude and the prevalence of these vulnerabilities
to ACIDRain attacks merits a broader reconsideration of
the success of the transaction concept as employed by programmers
today, in addition to further pursuit of research in this direction.
Based on our early experiences both performing ACIDRain attacks
on self-hosted applications as well as engaging with developers, we
believe there is considerable work to be done in raising awareness
of these attacks—for example, via improved analyses and additional
2AD refinement rules (including analysis of source code to
better highlight sources of error)—and in automated methods for defending
against these attacks—for example, by synthesizing repairs
such as automated isolation level tuning and selective application
of SELECT FOR UPDATE mechanisms. Our results here—as well as
existing instances of ACIDRain attacks in the wild—suggest there
is considerable value at stake.'
databases  transactions  vulnerability  security  acidrain  peter-bailis  storage  isolation  acid 
5 days ago by jm
Instapaper Outage Cause & Recovery
Hard to see this as anything other than a pretty awful documentation fail by the AWS RDS service:
Without knowledge of the pre-April 2014 file size limit, it was difficult to foresee and prevent this issue. As far as we can tell, there’s no information in the RDS console in the form of monitoring, alerts or logging that would have let us know we were approaching the 2TB file size limit, or that we were subject to it in the first place. Even now, there’s nothing to indicate that our hosted database has a critical issue.
limits  aws  rds  databases  mysql  filesystems  ops  instapaper  risks 
6 weeks ago by jm
square/shift
'shift is a [web] application that helps you run schema migrations on MySQL databases'
databases  mysql  sql  migrations  ops  square  ddl  percona 
7 weeks ago by jm
Booking.com, MySQL and UTF-8
good preso from Percona Live 2015 on the messiness of MySQL vs UTF-8 and utf8mb4
utf-8  utf8mb4  mysql  storage  databases  slides  booking.com  character-sets 
december 2016 by jm
MemC3: Compact and concurrent Memcache with dumber caching and smarter hashing
An improved hashing algorithm called optimistic cuckoo hashing, and a CLOCK-based eviction algorithm that works in tandem with it. They are evaluated in the context of Memcached, where combined they give up to a 30% memory usage reduction and up to a 3x improvement in queries per second as compared to the default Memcached implementation on read-heavy workloads with small objects (as is typified by Facebook workloads).
memcached  performance  key-value-stores  storage  databases  cuckoo-hashing  algorithms  concurrency  caching  cache-eviction  memory  throughput 
november 2016 by jm
Individual children's details passed to Home Office for immigration purposes | UK news | The Guardian
The UK's version of the POD database project was used by the Home Office to track immigrants for various reasons -- in other words, exactly the reasons why parents will choose not to provide that data
parents  databases  data  pod  uk  home-office  education  schools 
october 2016 by jm
Charity Majors responds to the CleverTap Mongo outage war story
This is a great blog post, spot on:
You can’t just go “dudes it’s faster” and jump off a cliff.  This shit is basic.  Test real production workloads. Have a rollback plan.  (Not for *10 days* … try a month or two.)


The only thing I'd nitpick on is that it's all very well to say "buy my book" or "come see me talk at Blahcon", but a good blog post or webpage would be thousands of times more useful.
databases  stateful-services  services  ops  mongodb  charity-majors  rollback  state  storage  testing  dba 
october 2016 by jm
Cross-Region Read Replicas for Amazon Aurora
Creating a read replica in another region also creates an Aurora cluster in the region. This cluster can contain up to 15 more read replicas, with very low replication lag (typically less than 20 ms) within the region (between regions, latency will vary based on the distance between the source and target). You can use this model to duplicate your cluster and read replica setup across regions for disaster recovery. In the event of a regional disruption, you can promote the cross-region replica to be the master. This will allow you to minimize downtime for your cross-region application. This feature applies to unencrypted Aurora clusters.
aws  mysql  databases  storage  replication  cross-region  failover  reliability  aurora 
june 2016 by jm
_DataEngConf: Parquet at Datadog_
"How we use Parquet for tons of metrics data". good preso from Datadog on their S3/Parquet setup
datadog  parquet  storage  s3  databases  hadoop  map-reduce  big-data 
may 2016 by jm
Counting with domain specific databases — The Smyte Blog — Medium
whoa, pretty heavily engineered scalable counting system with Kafka, RocksDB and Kubernetes
kafka  rocksdb  kubernetes  counting  databases  storage  ops 
april 2016 by jm
These unlucky people have names that break computers
Pat McKenzie's name is too long to fit in Japanese database schemas; Janice Keihanaikukauakahihulihe'ekahaunaele's name was too long for US schemas; and Jennifer Null suffers from the obvious problem
databases  design  programming  names  coding  japan  schemas 
march 2016 by jm
Jepsen: RethinkDB 2.1.5
A good review of RethinkDB! Hopefully not just because this test is contract work on behalf of the RethinkDB team ;)
I’ve run hundreds of test against RethinkDB at majority/majority, at various timescales, request rates, concurrencies, and with different types of failures. Consistent with the documentation, I have never found a linearization failure with these settings. If you use hard durability, majority writes, and majority reads, single-document ops in RethinkDB appear safe.
rethinkdb  databases  stores  storage  ops  availability  cap  jepsen  tests  replication 
january 2016 by jm
Open-sourcing PalDB, a lightweight companion for storing side data
a new LinkedIn open source data store, for write-once/read-mainly side data, java, Apache licensed.

RocksDB discussion: https://www.facebook.com/groups/rocksdb.dev/permalink/834956096602906/
linkedin  open-source  storage  side-data  data  config  paldb  java  apache  databases 
october 2015 by jm
Your Relative's DNA Could Turn You Into A Suspect
Familial DNA searching has massive false positives, but is being used to tag suspects:
The bewildered Usry soon learned that he was a suspect in the 1996 murder of an Idaho Falls teenager named Angie Dodge. Though a man had been convicted of that crime after giving an iffy confession, his DNA didn’t match what was found at the crime scene. Detectives had focused on Usry after running a familial DNA search, a technique that allows investigators to identify suspects who don’t have DNA in a law enforcement database but whose close relatives have had their genetic profiles cataloged. In Usry’s case the crime scene DNA bore numerous similarities to that of Usry’s father, who years earlier had donated a DNA sample to a genealogy project through his Mormon church in Mississippi. That project’s database was later purchased by Ancestry, which made it publicly searchable—a decision that didn’t take into account the possibility that cops might someday use it to hunt for genetic leads.

Usry, whose story was first reported in The New Orleans Advocate, was finally cleared after a nerve-racking 33-day wait — the DNA extracted from his cheek cells didn’t match that of Dodge’s killer, whom detectives still seek. But the fact that he fell under suspicion in the first place is the latest sign that it’s time to set ground rules for familial DNA searching, before misuse of the imperfect technology starts ruining lives.
dna  familial-dna  false-positives  law  crime  idaho  murder  mormon  genealogy  ancestry.com  databases  biometrics  privacy  genes 
october 2015 by jm
Cluster benchmark: Scylla vs Cassandra
ScyllaDB (the C* clone in C++) is now actually looking promising -- still need more reassurance about its consistency/reliabilty side though
scylla  databases  storage  cassandra  nosql 
october 2015 by jm
After Bara: All your (Data)base are belong to us
Sounds like the CJEU's Bara decision may cause problems for the Irish government's wilful data-sharing:
Articles 10, 11 and 13 of Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995, on the protection of individuals with regard to the processing of personal data and on the free movement of such data, must be interpreted as precluding national measures, such as those at issue in the main proceedings, which allow a public administrative body of a Member State to transfer personal data to another public administrative body and their subsequent processing, without the data subjects having been informed of that transfer or processing.
data  databases  bara  cjeu  eu  law  privacy  data-protection 
october 2015 by jm
Outage postmortem (2015-10-08 UTC) : Stripe: Help & Support
There was a breakdown in communication between the developer who requested the index migration and the database operator who deleted the old index. Instead of working on the migration together, they communicated in an implicit way through flawed tooling. The dashboard that surfaced the migration request was missing important context: the reason for the requested deletion, the dependency on another index’s creation, and the criticality of the index for API traffic. Indeed, the database operator didn’t have a way to check whether the index had recently been used for a query.


Good demo of how the Etsy-style chatops deployment approach would have helped avoid this risk.
stripe  postmortem  outages  databases  indexes  deployment  chatops  deploy  ops 
october 2015 by jm
SQL on Kafka using PipelineDB
this is quite nice. PipelineDB allows direct hookup of a Kafka stream, and will ingest durably and reliably, and provide SQL views computed over a sliding window of the stream.
logging  sql  kafka  pipelinedb  streaming  sliding-window  databases  search  querying 
september 2015 by jm
Scaling Analytics at Amplitude
Good blog post on Amplitude's lambda architecture setup, based on S3 and a custom "real-time set database" they wrote themselves.

antirez' comment from a Redis angle on the set database: http://antirez.com/news/92

HN thread: https://news.ycombinator.com/item?id=10118413
lambda-architecture  analytics  via:hn  redis  set-storage  storage  databases  architecture  s3  realtime 
august 2015 by jm
Mikhail Panchenko's thoughts on the July 2015 CircleCI outage
an excellent followup operational post on CircleCI's "database is not a queue" outage
database-is-not-a-queue  mysql  sql  databases  ops  outages  postmortems 
july 2015 by jm
Elements of Scale: Composing and Scaling Data Platforms
Great, encyclopedic blog post rounding up common architectural and algorithmic patterns using in scalable data platforms. Cut out and keep!
architecture  storage  databases  data  big-data  scaling  scalability  ben-stopford  cqrs  druid  parquet  columnar-stores  lambda-architecture 
may 2015 by jm
Please stop calling databases CP or AP
In his excellent blog post [...] Jeff Hodges recommends that you use the CAP theorem to critique systems. A lot of people have taken that advice to heart, describing their systems as “CP” (consistent but not available under network partitions), “AP” (available but not consistent under network partitions), or sometimes “CA” (meaning “I still haven’t read Coda’s post from almost 5 years ago”).

I agree with all of Jeff’s other points, but with regard to the CAP theorem, I must disagree. The CAP theorem is too simplistic and too widely misunderstood to be of much use for characterizing systems. Therefore I ask that we retire all references to the CAP theorem, stop talking about the CAP theorem, and put the poor thing to rest. Instead, we should use more precise terminology to reason about our trade-offs.
cap  databases  storage  distcomp  ca  ap  cp  zookeeper  consistency  reliability  networking 
may 2015 by jm
Call me maybe: Aerospike
'Aerospike offers phenomenal latencies and throughput -- but in terms of data safety, its strongest guarantees are similar to Cassandra or Riak in Last-Write-Wins mode. It may be a safe store for immutable data, but updates to a record can be silently discarded in the event of network disruption. Because Aerospike’s timeouts are so aggressive–on the order of milliseconds -- even small network hiccups are sufficient to trigger data loss. If you are an Aerospike user, you should not expect “immediate”, “read-committed”, or “ACID consistency”; their marketing material quietly assumes you have a magical network, and I assure you this is not the case. It’s certainly not true in cloud environments, and even well-managed physical datacenters can experience horrible network failures.'
aerospike  outages  cap  testing  jepsen  aphyr  databases  storage  reliability 
may 2015 by jm
Making Pinterest — Learn to stop using shiny new things and love MySQL
'The third reason people go for shiny is because older tech isn’t advertised as aggressively as newer tech. The younger companies needs to differentiate from the old guard and be bolder, more passionate and promise to fulfill your wildest dreams. But most new tech sales pitches aren’t generally forthright about their many failure modes. In our early days, we fell into this third trap. We had a lot of growing pains as we scaled the architecture. The most vocal and excited database companies kept coming to us saying they’d solve all of our scalability problems. But nobody told us of the virtues of MySQL, probably because MySQL just works, and people know about it.'

It's true! -- I'm still a happy MySQL user for some use cases, particularly read-mostly relational configuration data...
mysql  storage  databases  reliability  pinterest  architecture 
april 2015 by jm
devbook/README.md at master · barsoom/devbook
How to avoid the shitty behaviour of ActiveRecord wrt migration safety, particularly around removing/renaming columns. ugh, ActiveRecord
activerecord  fail  rails  mysql  sql  migrations  databases  schemas  releasing 
march 2015 by jm
Goodbye MongoDB, Hello PostgreSQL
Another core problem we’ve faced is one of the fundamental features of MongoDB (or any other schemaless storage engine): the lack of a schema. The lack of a schema may sound interesting, and in some cases it can certainly have its benefits. However, for many the usage of a schemaless storage engine leads to the problem of implicit schemas. These schemas aren’t defined by your storage engine but instead are defined based on application behaviour and expectations.


Well, don't say we didn't warn you ;)
mongodb  mysql  postgresql  databases  storage  schemas  war-stories 
march 2015 by jm
0x74696d | Falling In And Out Of Love with DynamoDB, Part II
Good DynamoDB real-world experience post, via Mitch Garnaat. We should write up ours, although it's pretty scary-stuff-free by comparison
aws  dynamodb  storage  databases  architecture  ops 
february 2015 by jm
Registering children: Ireland’s Primary Online Database
If you haven’t heard about it, it is a compulsory database of the personal information of children, including PPS numbers, ethnicity, race and language skills, to be held for decades and shared across State agencies.
privacy  ppsn  databases  pod  ireland  children  kids  primary-schools 
january 2015 by jm
Good advice on running large-scale database stress tests
I've been bitten by poor key distribution in tests in the past, so this is spot on: 'I'd run it with Zipfian, Pareto, and Dirac delta distributions, and I'd choose read-modify-write transactions.'

And of course, a dataset bigger than all combined RAM.

Also: http://smalldatum.blogspot.ie/2014/04/biebermarks.html -- the "Biebermark", where just a single row out of the entire db is contended on in a read/modify/write transaction: "the inspiration for this is maintaining counts for [highly contended] popular entities like Justin Bieber and One Direction."
biebermark  benchmarks  testing  performance  stress-tests  databases  storage  mongodb  innodb  foundationdb  aphyr  measurement  distributions  keys  zipfian 
december 2014 by jm
If Eventual Consistency Seems Hard, Wait Till You Try MVCC
ex-Percona MySQL wizard Baron Schwartz, noting that MVCC as implemented in common SQL databases is not all that simple or reliable compared to big bad NoSQL Eventual Consistency:
Since I am not ready to assert that there’s a distributed system I know to be better and simpler than eventually consistent datastores, and since I certainly know that InnoDB’s MVCC implementation is full of complexities, for right now I am probably in the same position most of my readers are: the two viable choices seem to be single-node MVCC and multi-node eventual consistency. And I don’t think MVCC is the simpler paradigm of the two.
nosql  concurrency  databases  mysql  riak  voldemort  eventual-consistency  reliability  storage  baron-schwartz  mvcc  innodb  postgresql 
december 2014 by jm
Hermitage: Testing the "I" in ACID
[Hermitage is] a test suite for databases which probes for a variety of concurrency issues, and thus allows a fair and accurate comparison of isolation levels. Each test case simulates a particular kind of race condition that can happen when two or more transactions concurrently access the same data. Each test can pass (if the database’s implementation of isolation prevents the race condition from occurring) or fail (if the race condition does occur).
acid  architecture  concurrency  databases  nosql 
november 2014 by jm
"Macaroons" for fine-grained secure database access
Macaroons are an excellent fit for NoSQL data storage for several reasons. First, they enable an application developer to enforce security policies at very fine granularity, per object. Gone are the clunky security policies based on the IP address of the client, or the per-table access controls of RDBMSs that force you to split up your data across many tables. Second, macaroons ensure that a client compromise does not lead to loss of the entire database. Third, macaroons are very flexible and expressive, able to incorporate information from external systems and third-party databases into authorization decisions. Finally, macaroons scale well and are incredibly efficient, because they avoid public-key cryptography and instead rely solely on fast hash functions.
security  macaroons  cookies  databases  nosql  case-studies  storage  authorization  hyperdex 
november 2014 by jm
Mnesia and CAP
A common “trick” is to claim:

'We assume network partitions can’t happen. Therefore, our system is CA according to the CAP theorem.'

This is a nice little twist. By asserting network partitions cannot happen, you just made your system into one which is not distributed. Hence the CAP theorem doesn’t even apply to your case and anything can happen. Your system may be linearizable. Your system might have good availability. But the CAP theorem doesn’t apply. [...]
In fact, any well-behaved system will be “CA” as long as there are no partitions. This makes the statement of a system being “CA” very weak, because it doesn’t put honesty first. I tries to avoid the hard question, which is how the system operates under failure. By assuming no network partitions, you assume perfect information knowledge in a distributed system. This isn’t the physical reality.
cap  erlang  mnesia  databases  storage  distcomp  reliability  ca  postgres  partitions 
october 2014 by jm
Understanding weak isolation is a serious problem
Peter Bailis complaining about the horrors of modern transactional databases and their unserializability, which noone seems to be paying attention to:

'As you’re probably aware, there’s an ongoing and often lively debate between transactional adherents and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional inherents and the research community as a whole have effectively ignored weak isolation — even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.'

'Despite the ubiquity of weak isolation, I haven’t found a database architect, researcher, or user who’s been able to offer an explanation of when, and, probably more importantly, why isolation models such as Read Committed are sufficient for correct execution. It’s reasonably well known that these weak isolation models represent “ACID in practice,” but I don’t think we have any real understanding of how so many applications are seemingly (!?) okay running under them. (If you haven’t seen these models before, they’re a little weird. For example, Read Committed isolation generally prevents users from reading uncommitted or non-final writes but allows a number of bad things to happen, like lost updates during concurrent read-modify-write operations. Why is this apparently okay for many applications?)'
acid  consistency  databases  peter-bailis  transactional  corruption  serializability  isolation  reliability 
september 2014 by jm
Aerospike's CA boast gets a thumbs-down from @aphyr
Specifically, @aerospikedb cannot offer cursor stability, repeatable read, snapshot isolation, or any flavor of serializability.
@nasav @aerospikedb At *best* you can offer Read Committed, which is not, I assert, what most people would expect from an "ACID" database.
aphyr  aerospike  availability  consistency  acid  transactions  distcomp  databases  storage 
september 2014 by jm
The Myth of Schema-less [NoSQL]
We don't seem to gain much in terms of database flexibility. Is our application more flexible? I don't think so. Even without our schema explicitly defined in our database, it's there... somewhere. You simply have to search through hundreds of thousands of lines to find all the little bits of it. It has the potential to be in several places, making it harder to properly identify. The reality of these codebases is that they are error prone and rarely lack the necessary documentation. This problem is magnified when there are multiple codebases talking to the same database. This is not an uncommon practice for reporting or analytical purposes.

Finally, all this "flexibility" rears its head in the same way that PHP and Javascript's "neat" weak typing stabs you right in the face. There are some somethings you can be cavalier about, and some things you should be strict about. Your data model is one you absolutely need to be strict on. If a field should store an int, it should store nothing else. Not a string, not a picture of a horse, but an integer. It's nice to know that I have my database doing type checking for me and I can expect a field to be the same type across all records.

All this leads us to an undeniable fact: There is always a schema. Wearing "I don't do schema" as a badge of honor is a complete joke and encourages a terrible development practice.
nosql  databases  storage  schema  strong-typing 
july 2014 by jm
'Robust De-anonymization of Large Sparse Datasets' [pdf]
paper by Arvind Narayanan and Vitaly Shmatikov, 2008.

'We present a new class of statistical de- anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary's background knowledge. We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world's largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber's record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.'
anonymisation  anonymization  sanitisation  databases  data-dumps  privacy  security  papers 
june 2014 by jm
How to make breaking changes and not break all the things
Well-written description of the "several backward-compatible changes" approach to breaking-change schema migration (via Marc)
databases  coding  compatibility  migration  schemas  sql  continuous-deployment 
june 2014 by jm
Call me maybe: Elasticsearch
Wow, these are terrible results. From the sounds of it, ES just cannot deal with realistic outage scenarios and is liable to suffer catastrophic damage in reasonably-common partitions.
If you are an Elasticsearch user (as I am): good luck. Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present. If you can, store your data in a safer database, and feed it into Elasticsearch gradually. Have processes in place that continually traverse the system of record, so you can recover from ES data loss automatically.
elasticsearch  ops  storage  databases  jepsen  partition  network  outages  reliability 
june 2014 by jm
CockroachDB
a distributed key/value datastore which supports ACID transactional semantics and versioned values as first-class features. The primary design goal is global consistency and survivability, hence the name. Cockroach aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. Cockroach nodes are symmetric; a design goal is one binary with minimal configuration and no required auxiliary services.

Cockroach implements a single, monolithic sorted map from key to value where both keys and values are byte strings (not unicode). Cockroach scales linearly (theoretically up to 4 exabytes (4E) of logical data). The map is composed of one or more ranges and each range is backed by data stored in RocksDB (a variant of LevelDB), and is replicated to a total of three or more cockroach servers. Ranges are defined by start and end keys. Ranges are merged and split to maintain total byte size within a globally configurable min/max size interval. Range sizes default to target 64M in order to facilitate quick splits and merges and to distribute load at hotspots within a key range. Range replicas are intended to be located in disparate datacenters for survivability (e.g. { US-East, US-West, Japan }, { Ireland, US-East, US-West}, { Ireland, US-East, US-West, Japan, Australia }).

Single mutations to ranges are mediated via an instance of a distributed consensus algorithm to ensure consistency. We’ve chosen to use the Raft consensus algorithm. All consensus state is stored in RocksDB.

A single logical mutation may affect multiple key/value pairs. Logical mutations have ACID transactional semantics. If all keys affected by a logical mutation fall within the same range, atomicity and consistency are guaranteed by Raft; this is the fast commit path. Otherwise, a non-locking distributed commit protocol is employed between affected ranges.

Cockroach provides snapshot isolation (SI) and serializable snapshot isolation (SSI) semantics, allowing externally consistent, lock-free reads and writes--both from an historical snapshot timestamp and from the current wall clock time. SI provides lock-free reads and writes but still allows write skew. SSI eliminates write skew, but introduces a performance hit in the case of a contentious system. SSI is the default isolation; clients must consciously decide to trade correctness for performance. Cockroach implements a limited form of linearalizability, providing ordering for any observer or chain of observers.


This looks nifty. One to watch.
cockroachdb  databases  storage  georeplication  raft  consensus  acid  go  key-value-stores  rocksdb 
may 2014 by jm
Scalable Atomic Visibility with RAMP Transactions
Great new distcomp protocol work from Peter Bailis et al:
We’ve developed three new algorithms—called Read Atomic Multi-Partition (RAMP) Transactions—for ensuring atomic visibility in partitioned (sharded) databases: either all of a transaction’s updates are observed, or none are. [...]

How they work: RAMP transactions allow readers and writers to proceed concurrently. Operations race, but readers autonomously detect the races and repair any non-atomic reads. The write protocol ensures readers never stall waiting for writes to arrive.

Why they scale: Clients can’t cause other clients to stall (via synchronization independence) and clients only have to contact the servers responsible for items in their transactions (via partition independence). As a consequence, there’s no mutual exclusion or synchronous coordination across servers.

The end result: RAMP transactions outperform existing approaches across a variety of workloads, and, for a workload of 95% reads, RAMP transactions scale to over 7 million ops/second on 100 servers at less than 5% overhead.
scale  synchronization  databases  distcomp  distributed  ramp  transactions  scalability  peter-bailis  protocols  sharding  concurrency  atomic  partitions 
april 2014 by jm
Huge Redis rant
I want to emphasize that if you use redis as intended (as a slightly-persistent, not-HA cache), it's great. Unfortunately, more and more shops seem to be thinking that Redis is a full-service database and, as someone who's had to spend an inordinate amount of time maintaining such a setup, it's not. If you're writing software and you're thinking "hey, it would be easy to just put a SET key value in this code and be done," please reconsider. There are lots of great products out there that are better for the overwhelming majority of use cases.


Ouch. (via Aphyr)
redis  storage  architecture  memory  caching  ha  databases 
february 2014 by jm
RocksDB
' A persistent key-value store for fast storage environments', ie. BerkeleyDB/LevelDB competitor, from Facebook.
RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation.

We benchmarked LevelDB and found that it was unsuitable for our server workloads. Thebenchmark results look awesome at first sight, but we quickly realized that those results were for a database whose size was smaller than the size of RAM on the test machine - where the entire database could fit in the OS page cache. When we performed the same benchmarks on a database that was at least 5 times larger than main memory, the performance results were dismal.

By contrast, we've published the RocksDB benchmark results for server side workloads on Flash. We also measured the performance of LevelDB on these server-workload benchmarks and found that RocksDB solidly outperforms LevelDB for these IO bound workloads. We found that LevelDB's single-threaded compaction process was insufficient to drive server workloads. We saw frequent write-stalls with LevelDB that caused 99-percentile latency to be tremendously large. We found that mmap-ing a file into the OS cache introduced performance bottlenecks for reads. We could not make LevelDB consume all the IOs offered by the underlying Flash storage.


Lots of good discussion at https://news.ycombinator.com/item?id=6736900 too.
flash  ssd  rocksdb  databases  storage  nosql  facebook  bdb  disk  key-value-stores  lsm  leveldb 
november 2013 by jm
The trouble with timestamps
Timestamps, as implemented in Riak, Cassandra, et al, are fundamentally unsafe ordering constructs. In order to guarantee consistency you, the user, must ensure locally monotonic and, to some extent, globally monotonic clocks. This is a hard problem, and NTP does not solve it for you. When wall clocks are not properly coupled to the operations in the system, causal constraints can be violated. To ensure safety properties hold all the time, rather than probabilistically, you need logical clocks.
clocks  time  distributed  databases  distcomp  ntp  via:fanf  aphyr  vector-clocks  last-write-wins  lww  cassandra  riak 
october 2013 by jm
LinkBench: A database benchmark for the social graph
However, the gold standard for database benchmarking is to test the performance of a system on the real production workload, since synthetic benchmarks often don't exercise systems in the same way. When making decisions about a significant component of Facebook's infrastructure, we need to understand how a database system will really perform in Facebook's production workload. [....] LinkBench addresses these needs by replicating the data model, graph structure, and request mix of our MySQL social graph workload.


Mentioned in a presentation from Peter Bailis, http://www.hpts.ws/papers/2013/bailis-hpts-2013.pdf
graph  databases  mysql  facebook  performance  testing  benchmarks  workloads 
october 2013 by jm
Google swaps out MySQL, moves to MariaDB
When we asked Sallner to quantify the scale of the migration he said, "They're moving it all. Everything they have. All of the MySQL servers are moving to MariaDB, as far as I understand."

By moving to MariaDB, Google can free itself of any dependence on technology dictated by Oracle – a company whose motivations are unclear, and whose track record for working with the wider technology community is dicey, to say the least. Oracle has controlled MySQL since its acquisition of Sun in 2010, and the key InnoDB storage engine since it got ahold of Innobase in 2005.

[...] We asked Cole why Google would shift from MySQL to MariaDB, and what the key technical differences between the systems were. "From my perspective, they're more or less equivalent other than if you look at specific features and how they implement them," Cole said, speaking in a personal capacity and not on behalf of Google. "Ideologically there are lots of differences."


So -- AWS, when will RDS offer MariaDB as an option?
google  mysql  mariadb  sql  open-source  licensing  databases  storage  innodb  oracle 
september 2013 by jm
Non-blocking transactional atomicity
Peter Bailis with an interesting distributed-storage atomicity algorithm for performing multi-record transactional updates
algorithms  nbta  transactions  databases  storage  distcomp  distributed  atomic  coding  eventual-consistency  crdts 
september 2013 by jm
LMDB response to a LevelDB-comparison blog post
This seems like a good point to note about LMDB in general:

We state quite clearly that LMDB is read-optimized, not write-optimized. I wrote this for the OpenLDAP Project; LDAP workloads are traditionally 80-90% reads. Write performance was not the goal of this design, read performance is. We make no claims that LMDB is a silver bullet, good for every situation. It’s not meant to be – but it is still far better at many things than all of the other DBs out there that *do* claim to be good for everything.
lmdb  leveldb  databases  openldap  storage  persistent 
august 2013 by jm
The Irish State wishes to uninvent computers with new FOI Bill
Mark Coughlan noticed this:
The FOI body shall take reasonable steps to search for and extract the records to which the request relates, having due regard to the steps that would be considered reasonable if the records were held in paper format.


In other words, pretend that computerised database technology, extant since the 1960s, does not exist. Genius (via Simon McGarr)
funny  irish  ireland  foi  open-data  freedom  computerisation  punch-cards  paper  databases 
august 2013 by jm
Lightning Memory-Mapped Database
Sounds like a good potential replacement for Berkeley DB, at least for cases where LevelDB isn't proving practical.
LMDB is a database storage engine similar to LevelDB or BDB which database authors often use as a base for building databases on top of. LMDB was designed as a replacement for BDB within the OpenLDAP project but it has been pretty useful to use with other databases as well. It’s API design is highly influenced by BDB so that replacing BDB is straightforward.


Licensed under the OpenLDAP Public License (is that BSDish?)
openldap  lmdb  databases  bdb  berkeley-db  storage  persistence  oss  open-source 
july 2013 by jm
Instagram: Making the Switch to Cassandra from Redis, a 75% 'Insta' Savings
shifting data out of RAM and onto SSDs -- unsurprisingly, big savings.
a 12 node cluster of EC2 hi1.4xlarge instances; we store around 1.2TB of data across this cluster. At peak, we're doing around 20,000 writes per second to that specific cluster and around 15,000 reads per second. We've been really impressed with how well Cassandra has been able to drop into that role.
ram  ssd  cassandra  databases  nosql  redis  instagram  storage  ec2 
june 2013 by jm
Call me maybe: Carly Rae Jepsen and the perils of network partitions
Kyle "aphyr" Kingsbury expands on his slides demonstrating the real-world failure scenarios that arise during some kinds of partitions (specifically, the TCP-hang, no clear routing failure, network partition scenario). Great set of blog posts clarifying CAP
distributed  network  databases  cap  nosql  redis  mongodb  postgresql  riak  crdt  aphyr 
may 2013 by jm
Berkeley DB Java Edition Architecture [PDF]
background white paper on the BDB-JE innards and design, from 2006. Still pretty accurate and good info
bdb-je  java  berkeley-db  bdb  design  databases  pdf  white-papers  trees 
may 2013 by jm
Alex Feinberg's response to Damien Katz' anti-Dynamoish/pro-Couchbase blog post
Insightful response, worth bookmarking. (the original post is at http://damienkatz.net/2013/05/dynamo_sure_works_hard.html ).
while you are saving on read traffic (online reads only go to the master), you are now decreasing availability (contrary to your stated goal), and increasing system complexity.
You also do hurt performance by requiring all writes and reads to be serialized through a single node: unless you plan to have a leader election whenever the node fails to meet a read SLA (which is going to result a disaster -- I am speaking from personal experience), you will have to accept that you're bottlenecked by a single node. With a Dynamo-style quorum (for either reads or writes), a single straggler will not reduce whole-cluster latency.
The core point of Dynamo is low latency, availability and handling of all kinds of partitions: whether clean partitions (long term single node failures), transient failures (garbage collection pauses, slow disks, network blips, etc...), or even more complex dependent failures.
The reality, of course, is that availability is neither the sole, nor the principal concern of every system. It's perfect fine to trade off availability for other goals -- you just need to be aware of that trade off.
cap  distributed-databases  databases  quorum  availability  scalability  damien-katz  alex-feinberg  partitions  network  dynamo  riak  voldemort  couchbase 
may 2013 by jm
Is Your MySQL Buffer Pool Warm? Make It Sweat!
How GroupOn are warming up a failover warm MySQL spare, using Percona stuff and a "tee" of the live in-flight queries. (via Dave Doran)
via:dave-doran  mysql  databases  warm-spares  spares  failover  groupon  percona  replication 
april 2013 by jm
CouchDB: not drinking the kool-aid
Jonathan Ellis on some CouchDB negatives:
Here are some reasons you should think twice and do careful testing before using CouchDB in a non-toy project:
Writes are serialized.  Not serialized as in the isolation level, serialized as in there can only be one write active at a time.  Want to spread writes across multiple disks?  Sorry.
CouchDB uses a MVCC model, which means that updates and deletes need to be compacted for the space to be made available to new writes.  Just like PostgreSQL, only without the man-years of effort to make vacuum hurt less.
CouchDB is simple.  Gloriously simple.  Why is that a negative?  It's competing with systems (in the popular imagination, if not in its author's mind) that have been maturing for years.  The reason PostgreSQL et al have those features is because people want them.  And if you don't, you should at least ask a DBA with a few years of non-MySQL experience what you'll be missing.  The majority of CouchDB fans don't appear to really understand what a good relational database gives them, just as a lot of PHP programmers don't get what the big deal is with namespaces.
A special case of simplicity deserves mention: nontrivial queries must be created as a view with mapreduce.  MapReduce is a great approach to trivially parallelizing certain classes of problem.  The problem is, it's tedious and error-prone to write raw MapReduce code.  This is why Google and Yahoo have both created high-level languages on top of it (Sawzall and Pig, respectively).  Poor SQL; even with DSLs being the new hotness, people forget that SQL is one of the original domain-specific languages.  It's a little verbose, and you might be bored with it, but it's much better than writing low-level mapreduce code.
cassandra  couch  nosql  storage  distributed  databases  consistency 
april 2013 by jm
Why I'm Walking Away From CouchDB
In practice there are two gotchas that are so painful I am  looking for a replacement with a different featureset than couchdb provides. The location tracking project icecondor.com uses couchdb to store 20,000 new records per day. It has more write traffic than read traffic and runs on modest hardware. Those two gotchas are:

1. View Index updates.

While I have a vague understanding of why view index updates are slow and bulky and important, in practice it is unworkable. Every write sets up a trap for the first reader to come along after the write. The more writes there are, the bigger the trap for the first reader which has to wait on the couchdb process that refreshes the view index on an as-needed basis. I believe this trade-off was made to keep writes fast. No need to update the view index until all writes are actually complete, right? Write traffic is heavier than read traffic and the time needed for that index refresh causes the webapp to crash because its not setup to handle timeouts from a database query. The workaround is as hackish as one can imagine -  cron jobs to hit every  map/reduce query to keep indexes fresh.

2. Append only database file

Append only is in theory a great way to ensure on-disk reliability. A system crash during an append should only affect that append. Its a crash during an update to existing parts of the file that risks the integrity of more than whats being updated. With so many layers of caching and optimizations in the kernel and the filesystem and now in the workings of SSD drives, I'm not sure append-only gives extra protection anymore.

What it does do is a create a huge operational headache. The on-disk file can never grow beyond half the available storage space. Record deletion uses new disk space and if the half-full mark approaches, vacuuming must be done. The entire database is rewritten to the filesystem, leaving out no longer needed records. If the data file should happen to grow beyond half the partition, the system has esentially crashed because there is no way to compact the file and soon the partition will be full. This is a likely scenario when there is a lot of record deletion activity.

The system in question does a lot of writes of temporary data that is followed up by deletes a few days later. There is also a lot of permanent storage that hardly gets used. Rewriting every byte of the records that are long-lived due to compaction is an enormous amount of wasted I/O - doubly so given SSD drives have a short write-cycle lifespan.
nosql  couchdb  consistency  checkpointing  databases  data-stores  indexing 
april 2013 by jm
Data Corruption To Go: The Perils Of sql_mode = NULL « Code as Craft
bloody hell. A load of cases where MySQL will happily accommodate all sorts of malformed and invalid input -- thankfully with fixes.

Also includes a very nifty example of Etsy tee'ing their production db traffic (30k pps in and out) via tcpdump and pt-query-digest to a test database host. Fantastic hackery
mysql  input  corrupt  invalid  validation  coding  databases  sql  testing  tcpdump  percona  pt-query-digest  tee 
march 2013 by jm
Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack
reasonably good whole-stack performance testing and analysis; HBase, Riak, MongoDB, and Cassandra compared. Riak did pretty badly :(
riak  mongodb  cassandra  hbase  performance  analytics  hadoop  hive  big-data  storage  databases  nosql 
february 2013 by jm
Basho | Alert Logic Relies on Riak to Support Rapid Growth
'The new [Riak-based] analytics infrastructure performs statistical and correlation processing on all data [...] approximately 5 TB/day. All of this data is processed in real-time as it streams in. [...] Alert Logic’s analytics infrastructure, powered by Riak, achieves performance results of up to 35k operations/second across each node in the cluster – performance that eclipses the existing MySQL deployment by a large margin on single node performance. In real business terms, the initial deployment of the combination of Riak and the analytic infrastructure has allowed Alert Logic to process in real-time 7,500 reports, which previously took 12 hours of dedicated processing every night.'

Twitter discussion here: https://twitter.com/fisherpk/status/294984960849367040 , which notes 'heavily cached SAN storage, 12 core blades and 90% get to put ops', and '3 riak nodes, 12-cores, 30k get heavy riak ops/sec. 8 nodes driving ops to that cluster'. Apparently the use of SAN storage on all nodes is historic, but certainly seems to have produced good iops numbers as an (expensive) side-effect...
iops  riak  basho  ops  systems  alert-logic  storage  nosql  databases 
january 2013 by jm
Goodbye, CouchDB
'From most model-using code, using [Percona] MySQL looks exactly the same as using CouchDB did. Except it’s faster, and the DB basically never fails.'
couchdb  mysql  nosql  databases  storage  percona  via:peakscale 
may 2012 by jm
LevelDB Benchmarks
nice results, particularly for sequential ops. will be a Riak backend vs InnoDB
leveldb  riak  databases  files  disk  google  storage  benchmarks 
july 2011 by jm
How we use Redis at Bump
via Simon Willison. some nice ideas here, particularly using a replication slave to handle the potentially latency-impacting disk writes in AOF mode
queueing  redis  nosql  databases  storage  via:simonw  replication  bump 
july 2011 by jm
GitHub outage post-mortem
continuous-integration system was accidentally run against the production db. result: the entire production database got wiped. ouuuuch
ouch  github  outages  post-mortem  databases  testing  c-i  production  firewalls  from delicious
november 2010 by jm
Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs [PDF]
sort-and-merge is likely to be faster on future SIMD-capable multicore CPUs RSN
sort  merge  hash  join  databases  performance  cpu  simd  multicore  from delicious
june 2010 by jm
GitHub scheduled maintainance due to Redis upgrade
good comments on the processes useful for large-scale Redis upgrades
upgrades  redis  spof  nosql  databases  github  deployment  from delicious
may 2010 by jm
NoSQL at Twitter (NoSQL EU 2010) [PDF]
specifically, Hadoop and Pig for log/metrics analytics, Cassandra going forward; great preso, lots of detail and code examples. also, impressive number-crunching going on at Twitter
twitter  analytics  cassandra  databases  hadoop  pdf  logs  metrics  number-crunching  nosql  pig  presentation  slides  scribe  from delicious
april 2010 by jm
Humblog - Philip Kirwan Ripped Off My iPhone App Content
ouch, nasty allegations. Strikes me that there's a chicken/egg problem: scraping the Dublin Bus website to build a database which you then sell as part of a commercial iPhone app is probably pretty shaky ground to start with
ip  databases  collation  collections  dublin-bus  iphone  apps  scraping  from delicious
january 2010 by jm
Why I like Redis
Simon Willison plugs Redis as a good datastore for quick-hack scripts with requirements for lots of fast, local data storage -- the kind of thing I'd often use a DB_File for
python  storage  databases  schemaless  nosql  redis  simon-willison  data-store  from delicious
october 2009 by jm

related tags

acid  acidrain  activerecord  aerospike  alert-logic  alex-feinberg  algorithms  analytics  ancestry.com  anonymisation  anonymization  ap  apache  aphyr  apps  architecture  atomic  aurora  authorization  availability  aws  banking  banks  bara  baron-schwartz  basho  bdb  bdb-je  ben-stopford  benchmarks  berkeley-db  biebermark  big-data  biometrics  booking.com  bump  c-i  ca  cache-eviction  caching  cap  cap-theorem  case-studies  cassandra  character-sets  charity-majors  chatops  checkpointing  children  cjeu  clocks  cockroachdb  coding  collation  collections  column-oriented  columnar-stores  compatibility  computerisation  concurrency  config  consensus  consistency  continuous-deployment  cookies  corrupt  corruption  couch  couchbase  couchdb  counting  cp  cpu  cqrs  crdt  crdts  crime  cross-region  cuckoo-hashing  damien-katz  data  data-dumps  data-protection  data-store  data-stores  database-is-not-a-queue  databases  datadog  dba  ddl  deploy  deployment  design  disk  distcomp  distributed  distributed-databases  distributed-systems  distributions  dna  druid  dublin-bus  durability  dynamo  dynamodb  ec2  education  elasticache  elasticsearch  erlang  eu  eventual-consistency  facebook  fail  failover  false-positives  familial-dna  files  filesystems  firewalls  flash  foi  foundationdb  freedom  funny  genealogy  genes  georeplication  git  github  go  google  graph  groupon  ha  hadoop  hash  hbase  hive  home-office  hyperdex  idaho  indexes  indexing  innodb  input  instagram  instapaper  insurance  invalid  iops  ip  iphone  ireland  irish  isolation  japan  java  jay-kreps  jepsen  join  kafka  key-value-stores  keys  kids  kubernetes  lambda-architecture  last-write-wins  law  leveldb  licensing  limits  linkedin  lmdb  log  logging  logs  lsm  lww  macaroons  map-reduce  marc-brooker  mariadb  measurement  memcached  memory  merge  merging  metrics  migration  migrations  mnesia  mongodb  mormon  multicore  murder  mvcc  mysql  names  nbta  network  networking  nosql  ntp  number-crunching  open-data  open-source  openldap  ops  oracle  oss  ouch  outages  pacelc  paldb  paper  papers  parents  parquet  partition  partitions  pdf  percona  performance  persistence  persistent  peter-bailis  pig  pinterest  pipelinedb  pod  post-mortem  postgres  postgresql  postmortem  postmortems  ppsn  presentation  primary-schools  privacy  production  programming  protocols  pt-query-digest  punch-cards  python  querying  queueing  quorum  raft  rails  ram  ramp  rds  realtime  redis  releasing  reliability  replication  rethinkdb  riak  risks  rocksdb  rollback  s3  sanitisation  scalability  scale  scaling  schema  schemaless  schemas  schools  scraping  scribe  scylla  search  security  serializability  services  set  set-cover  set-storage  sharding  side-data  simd  simon-willison  slides  sliding-window  sort  spares  spof  sql  square  ssd  state  stateful-services  storage  stores  streaming  stress-tests  stripe  strong-typing  sync  synchronization  systems  tcpdump  tee  testing  tests  throughput  time  timelines  transactional  transactions  trees  twitter  uk  upgrades  utf-8  utf8mb4  validation  vector-clocks  via:dave-doran  via:fanf  via:hn  via:peakscale  via:simonw  voldemort  vulnerability  war-stories  warm-spares  white-papers  workloads  zipfian  zookeeper 

Copy this bookmark:



description:


tags: