jm + storage   158

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications - All Things Distributed
A deep dive on how we were using our existing databases revealed that they were frequently not used for their relational capabilities. About 70 percent of operations were of the key-value kind, where only a primary key was used and a single row would be returned. About 20 percent would return a set of rows, but still operate on only a single table.

With these requirements in mind, and a willingness to question the status quo, a small group of distributed systems experts came together and designed a horizontally scalable distributed database that would scale out for both reads and writes to meet the long-term needs of our business. This was the genesis of the Amazon Dynamo database.

The success of our early results with the Dynamo database encouraged us to write Amazon's Dynamo whitepaper and share it at the 2007 ACM Symposium on Operating Systems Principles (SOSP conference), so that others in the industry could benefit. The Dynamo paper was well-received and served as a catalyst to create the category of distributed database technologies commonly known today as "NoSQL."

That's not an exaggeration. Nice one Werner et al!
dynamo  history  nosql  storage  databases  distcomp  amazon  papers  acm  data-stores 
14 days ago by jm
"Why We Built Our Own Distributed Column Store" (video)
"Why We Built Our Own Distributed Column Store" by Sam Stokes of -- Retriever, inspired by Facebook's Scuba
scuba  retriever  storage  data-stores  columnar-storage  databases  via:charitymajors 
14 days ago by jm
Intel pcj library for persistent memory-oriented data structures
This is a "pilot" project to develop a library for Java objects stored in persistent memory. Persistent collections are being emphasized because many applications for persistent memory seem to map well to the use of collections. One of this project's goals is to make programming with persistent objects feel natural to a Java developer, for example, by using familiar Java constructs when incorporating persistence elements such as data consistency and object lifetime.

The breadth of persistent types is currently limited and the code is not performance-optimized. We are making the code available because we believe it can be useful in experiments to retrofit existing Java code to use persistent memory and to explore persistent Java programming in general.

(via Mario Fusco)
persistent-memory  data-structures  storage  persistence  java  coding  future 
20 days ago by jm
bet365 to save Riak
'It is our intention to open source all of Basho's products and all of
the source code that they have been working on. We'll do this as quickly as
we are able to organise it, and we would appreciate some input from the
community on how you would like this done.'
oss  riak  basho  storage  open-source  bet365 
8 weeks ago by jm
Arq Backs Up To B2!
Arq backup for OSX now supports B2 (as well as S3) as a storage backend.
"it’s a super-cheap option ($.005/GB per month) for storing your backups." (that is less than half the price of $0.0125/GB for S3's Infrequent Access class)
s3  storage  b2  backblaze  backups  arq  macosx  ops 
9 weeks ago by jm
Will the last person at Basho please turn out the lights? • The Register
Basho, once a rising star of the NoSQL database world, has faded away to almost nothing [...] According to sources, the company, which developed the Riak distributed database, has been shedding engineers for months, and is now operating as a shadow of its former self, as at least one buy-out has fallen through.
basho  riak  nosql  databases  storage  startups  funding 
july 2017 by jm
Don't Settle For Eventual Consistency
Quite an argument. Not sure I agree, but worth a bookmark anyway...
With an AP system, you are giving up consistency, and not really gaining anything in terms of effective availability, the type of availability you really care about.  Some might think you can regain strong consistency in an AP system by using strict quorums (where the number of nodes written + number of nodes read > number of replicas).  Cassandra calls this “tunable consistency”.  However, Kleppmann has shown that even with strict quorums, inconsistencies can result.10  So when choosing (algorithmic) availability over consistency, you are giving up consistency for not much in return, as well as gaining complexity in your clients when they have to deal with inconsistencies.
cap-theorem  databases  storage  cap  consistency  cp  ap  eventual-consistency 
june 2017 by jm
_Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases_
'Amazon Aurora is a relational database service for OLTP workloads offered as part of Amazon Web Services (AWS). In this paper, we describe the architecture of Aurora and the design considerations leading to that architecture. We believe the central constraint in high throughput data processing has moved from compute and storage to the network. Aurora brings a novel architecture to the relational database to address this constraint, most notably by pushing redo processing to a multi-tenant scale-out storage service, purpose-built for Aurora. We describe how doing so not only reduces network traffic, but also allows for fast crash recovery, failovers to replicas without loss of data, and fault-tolerant, self-healing storage. We then describe how Aurora achieves consensus on durable state across numerous storage nodes using an efficient asynchronous scheme, avoiding expensive and chatty recovery protocols. Finally, having operated Aurora as a production service for over 18 months, we share the lessons we have learnt from our customers on what modern cloud applications expect from databases.'
via:rbranson  aurora  aws  amazon  databases  storage  papers  architecture 
may 2017 by jm
Developing a time-series "database" based on HdrHistogram
Histogram aggregation is definitely a sensible way to store this kind of data
storage  elasticsearch  metrics  hdrhistogram  histograms  tideways 
april 2017 by jm
Amazon DynamoDB Accelerator (DAX)
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. DAX does all the heavy lifting required to add in-memory acceleration to your DynamoDB tables, without requiring developers to manage cache invalidation, data population, or cluster management.

No latency percentile figures, unfortunately. Also still in preview.
amazon  dynamodb  aws  dax  performance  storage  databases  latency  low-latency 
april 2017 by jm
When Boring is Awesome: Building a scalable time-series database on PostgreSQL
Nice. we built something along these lines atop MySQL before -- partitioning by timestamp is the key. (via Nelson)
database  postgresql  postgres  timeseries  tsd  storage  state  via:nelson 
april 2017 by jm
Deep Dive on Amazon EBS Elastic Volumes
'March 2017 AWS Online Tech Talks' -- lots about the new volume types
aws  ebs  storage  architecture  ops  slides 
march 2017 by jm
[no title]
'For decades, the transaction concept has played a central role in
database research and development. Despite this prominence, transactional
databases today often surface much weaker models than the
classic serializable isolation guarantee—and, by default, far weaker
models than alternative,“strong but not serializable” models such as
Snapshot Isolation. Moreover, the transaction concept requires the
programmer’s involvement: should an application programmer fail
to correctly use transactions by appropriately encapsulating functionality,
even serializable transactions will expose programmers
to errors. While many errors arising from these practices may be
masked by low concurrency during normal operation, they are susceptible
to occur during periods of abnormally high concurrency. By
triggering these errors via concurrent access in a deliberate attack, a
determined adversary could systematically exploit them for gain.
In this work, we defined the problem of ACIDRain attacks and
introduced 2AD, a lightweight dynamic analysis tool that uses traces
of normal database activity to detect possible anomalous behavior
in applications. To enable 2AD, we extended Adya’s theory of weak
isolation to allow efficient reasoning over the space of all possible
concurrent executions of a set of transactions based on a concrete
history, via a new concept called an abstract history, which also
applies to API calls. We then applied 2AD analysis to twelve popular
self-hosted eCommerce applications, finding 22 vulnerabilities
spread across all but one application we tested, affecting over 50%
of eCommerce sites on the Internet today.

We believe that the magnitude and the prevalence of these vulnerabilities
to ACIDRain attacks merits a broader reconsideration of
the success of the transaction concept as employed by programmers
today, in addition to further pursuit of research in this direction.
Based on our early experiences both performing ACIDRain attacks
on self-hosted applications as well as engaging with developers, we
believe there is considerable work to be done in raising awareness
of these attacks—for example, via improved analyses and additional
2AD refinement rules (including analysis of source code to
better highlight sources of error)—and in automated methods for defending
against these attacks—for example, by synthesizing repairs
such as automated isolation level tuning and selective application
of SELECT FOR UPDATE mechanisms. Our results here—as well as
existing instances of ACIDRain attacks in the wild—suggest there
is considerable value at stake.'
databases  transactions  vulnerability  security  acidrain  peter-bailis  storage  isolation  acid 
march 2017 by jm
Manage DynamoDB Items Using Time to Live (TTL)
good call.
Many DynamoDB users store data that has a limited useful life or is accessed less frequently over time. Some of them track recent logins, trial subscriptions, or application metrics. Others store data that is subject to regulatory or contractual limitations on how long it can be stored. Until now, these customers implemented their own time-based data management. At scale, this sometimes meant that they ran a couple of Amazon Elastic Compute Cloud (EC2) instances that did nothing more than scan DynamoDB items, check date attributes, and issue delete requests for items that were no longer needed. This added cost and complexity to their application. In order to streamline this popular and important use case, we are launching a new Time to Live (TTL) feature today. You can enable this feature on a table-by-table basis, specifying an item attribute that contains the expiration time for the item.
dynamodb  ttl  storage  aws  architecture  expiry 
february 2017 by jm
Spotify's read-only k/v store
spotify  sparkey  read-only  key-value  storage  ops  architecture 
february 2017 by jm
Beringei: A high-performance time series storage engine | Engineering Blog | Facebook Code
Beringei is different from other in-memory systems, such as memcache, because it has been optimized for storing time series data used specifically for health and performance monitoring. We designed Beringei to have a very high write rate and a low read latency, while being as efficient as possible in using RAM to store the time series data. In the end, we created a system that can store all the performance and monitoring data generated at Facebook for the most recent 24 hours, allowing for extremely fast exploration and debugging of systems and services as we encounter issues in production.

Data compression was necessary to help reduce storage overhead. We considered several existing compression schemes and rejected the techniques that applied only to integer data, used approximation techniques, or needed to operate on the entire dataset. Beringei uses a lossless streaming compression algorithm to compress points within a time series with no additional compression used across time series. Each data point is a pair of 64-bit values representing the timestamp and value of the counter at that time. Timestamps and values are compressed separately using information about previous values. Timestamp compression uses a delta-of-delta encoding, so regular time series use very little memory to store timestamps.

From analyzing the data stored in our performance monitoring system, we discovered that the value in most time series does not change significantly when compared to its neighboring data points. Further, many data sources only store integers (despite the system supporting floating point values). Knowing this, we were able to tune previous academic work to be easier to compute by comparing the current value with the previous value using XOR, and storing the changed bits. Ultimately, this algorithm resulted in compressing the entire data set by at least 90 percent.
beringei  compression  facebook  monitoring  tsd  time-series  storage  architecture 
february 2017 by jm, MySQL and UTF-8
good preso from Percona Live 2015 on the messiness of MySQL vs UTF-8 and utf8mb4
utf-8  utf8mb4  mysql  storage  databases  slides  character-sets 
december 2016 by jm
MemC3: Compact and concurrent Memcache with dumber caching and smarter hashing
An improved hashing algorithm called optimistic cuckoo hashing, and a CLOCK-based eviction algorithm that works in tandem with it. They are evaluated in the context of Memcached, where combined they give up to a 30% memory usage reduction and up to a 3x improvement in queries per second as compared to the default Memcached implementation on read-heavy workloads with small objects (as is typified by Facebook workloads).
memcached  performance  key-value-stores  storage  databases  cuckoo-hashing  algorithms  concurrency  caching  cache-eviction  memory  throughput 
november 2016 by jm
Charity Majors responds to the CleverTap Mongo outage war story
This is a great blog post, spot on:
You can’t just go “dudes it’s faster” and jump off a cliff.  This shit is basic.  Test real production workloads. Have a rollback plan.  (Not for *10 days* … try a month or two.)

The only thing I'd nitpick on is that it's all very well to say "buy my book" or "come see me talk at Blahcon", but a good blog post or webpage would be thousands of times more useful.
databases  stateful-services  services  ops  mongodb  charity-majors  rollback  state  storage  testing  dba 
october 2016 by jm
A Loud Sound Just Shut Down a Bank's Data Center for 10 Hours | Motherboard
The purpose of the drill was to see how the data center's fire suppression system worked. Data centers typically rely on inert gas to protect the equipment in the event of a fire, as the substance does not chemically damage electronics, and the gas only slightly decreases the temperature within the data center.

The gas is stored in cylinders, and is released at high velocity out of nozzles uniformly spread across the data center. According to people familiar with the system, the pressure at ING Bank's data center was higher than expected, and produced a loud sound when rapidly expelled through tiny holes (think about the noise a steam engine releases). The bank monitored the sound and it was very loud, a source familiar with the system told us. “It was as high as their equipment could monitor, over 130dB”.

Sound means vibration, and this is what damaged the hard drives. The HDD cases started to vibrate, and the vibration was transmitted to the read/write heads, causing them to go off the data tracks. “The inert gas deployment procedure has severely and surprisingly affected several servers and our storage equipment,” ING said in a press release.
ing  hardware  outages  hard-drives  fire  fire-suppression  vibration  data-centers  storage 
september 2016 by jm
Why Uber Engineering Switched from Postgres to MySQL
Uber bringing the smackdown for the HN postgres fanclub, with some juicy technical details of issues that caused them pain. FWIW, I was bitten by crappy postgres behaviour in the past (specifically around vacuuming and pgbouncer), so I've long been a MySQL fan ;)
database  mysql  postgres  postgresql  uber  architecture  storage  sql 
july 2016 by jm
ClickHouse — open-source distributed column-oriented DBMS
'ClickHouse manages extremely large volumes of data in a stable and sustainable manner. It currently powers Yandex.Metrica, world’s second largest web analytics platform, with over 13 trillion database records and over 20 billion events a day, generating customized reports on-the-fly, directly from non-aggregated data. This system was successfully implemented at CERN’s LHCb experiment to store and process metadata on 10bn events with over 1000 attributes per event registered in 2011.'

Yandex-tastic, but still looks really interesting
yandex  analytics  database  storage  sql  clickhouse 
june 2016 by jm
Cross-Region Read Replicas for Amazon Aurora
Creating a read replica in another region also creates an Aurora cluster in the region. This cluster can contain up to 15 more read replicas, with very low replication lag (typically less than 20 ms) within the region (between regions, latency will vary based on the distance between the source and target). You can use this model to duplicate your cluster and read replica setup across regions for disaster recovery. In the event of a regional disruption, you can promote the cross-region replica to be the master. This will allow you to minimize downtime for your cross-region application. This feature applies to unencrypted Aurora clusters.
aws  mysql  databases  storage  replication  cross-region  failover  reliability  aurora 
june 2016 by jm
Collecting my thoughts about Torus
Worryingly-optimistic communications about CoreOS' recently-announced distributed storage system. I had similar thoughts, but Jeff Darcy is actually an expert on this stuff so he's way more worth listening to on the topic ;)
jeff-darcy  distcomp  filesystems  coreos  torus  storage 
june 2016 by jm
_DataEngConf: Parquet at Datadog_
"How we use Parquet for tons of metrics data". good preso from Datadog on their S3/Parquet setup
datadog  parquet  storage  s3  databases  hadoop  map-reduce  big-data 
may 2016 by jm
BTrDB: Optimizing Storage System Design for Timeseries Processing
interesting, although they punt to Ceph for storage and miss out the chance to make a CRDT
storage  trees  data-structures  timeseries  delta-delta-coding  encoding  deltas 
may 2016 by jm
git for Cloud Storage. Create distributed, decentralized and versioned repositories that scale infinitely to 100s of millions of files and PBs of storage. Huge repos can be cloned on your local SSD for making changes, committing and pushing back. Oh yeah, and it dedupes too due to BLAKE2 Tree hashing.
git  ops  storage  cloud  s3  disk  aws  version-control  blake2 
april 2016 by jm
Counting with domain specific databases — The Smyte Blog — Medium
whoa, pretty heavily engineered scalable counting system with Kafka, RocksDB and Kubernetes
kafka  rocksdb  kubernetes  counting  databases  storage  ops 
april 2016 by jm
Jepsen: RethinkDB 2.1.5
A good review of RethinkDB! Hopefully not just because this test is contract work on behalf of the RethinkDB team ;)
I’ve run hundreds of test against RethinkDB at majority/majority, at various timescales, request rates, concurrencies, and with different types of failures. Consistent with the documentation, I have never found a linearization failure with these settings. If you use hard durability, majority writes, and majority reads, single-document ops in RethinkDB appear safe.
rethinkdb  databases  stores  storage  ops  availability  cap  jepsen  tests  replication 
january 2016 by jm
Open-sourcing PalDB, a lightweight companion for storing side data
a new LinkedIn open source data store, for write-once/read-mainly side data, java, Apache licensed.

RocksDB discussion:
linkedin  open-source  storage  side-data  data  config  paldb  java  apache  databases 
october 2015 by jm
Cluster benchmark: Scylla vs Cassandra
ScyllaDB (the C* clone in C++) is now actually looking promising -- still need more reassurance about its consistency/reliabilty side though
scylla  databases  storage  cassandra  nosql 
october 2015 by jm
The New InfluxDB Storage Engine: A Time Structured Merge Tree
The new engine has similarities with LSM Trees (like LevelDB and Cassandra’s underlying storage). It has a write ahead log, index files that are read only, and it occasionally performs compactions to combine index files. We’re calling it a Time Structured Merge Tree because the index files keep contiguous blocks of time and the compactions merge those blocks into larger blocks of time. Compression of the data improves as the index files are compacted. Once a shard becomes cold for writes it will be compacted into as few files as possible, which yield the best compression.
influxdb  storage  lsm-trees  leveldb  tsm-trees  data-structures  algorithms  time-series  tsd  compression 
october 2015 by jm
Introduction to HDFS Erasure Coding in Apache Hadoop
How Hadoop did EC. Erasure Coding support ("HDFS-EC") is set to be released in Hadoop 3.0 apparently
erasure-coding  reed-solomon  algorithms  hadoop  hdfs  cloudera  raid  storage 
september 2015 by jm
a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. S3QL effectively provides a hard disk of dynamic, infinite capacity that can be accessed from any computer with internet access running Linux, FreeBSD or OS-X.
S3QL is a standard conforming, full featured UNIX file system that is conceptually indistinguishable from any local file system. Furthermore, S3QL has additional features like compression, encryption, data de-duplication, immutable trees and snapshotting which make it especially suitable for online backup and archival.
S3QL is designed to favor simplicity and elegance over performance and feature-creep. Care has been taken to make the source code as readable and serviceable as possible. Solid error detection and error handling have been included from the very first line, and S3QL comes with extensive automated test cases for all its components.
filesystems  aws  s3  storage  unix  google-storage  openstack 
september 2015 by jm
Scaling Analytics at Amplitude
Good blog post on Amplitude's lambda architecture setup, based on S3 and a custom "real-time set database" they wrote themselves.

antirez' comment from a Redis angle on the set database:

HN thread:
lambda-architecture  analytics  via:hn  redis  set-storage  storage  databases  architecture  s3  realtime 
august 2015 by jm
What does it take to make Google work at scale? [slides]
50-slide summary of Google's stack, compared vs Facebook, Yahoo!, and open-source-land, with the odd interesting architectural insight
google  architecture  slides  scalability  bigtable  spanner  facebook  gfs  storage 
august 2015 by jm
Amazon S3 Introduces New Usability Enhancements
bucket limit increase, and read-after-write consistency in US Standard. About time too! ;)
aws  s3  storage  consistency 
august 2015 by jm
Google Photos - Can I get out?
what's the export policy for Google's new Photos service? pretty good, it turns out
google  export  data  google-photos  photos  archive  history  storage 
june 2015 by jm
Elements of Scale: Composing and Scaling Data Platforms
Great, encyclopedic blog post rounding up common architectural and algorithmic patterns using in scalable data platforms. Cut out and keep!
architecture  storage  databases  data  big-data  scaling  scalability  ben-stopford  cqrs  druid  parquet  columnar-stores  lambda-architecture 
may 2015 by jm
Please stop calling databases CP or AP
In his excellent blog post [...] Jeff Hodges recommends that you use the CAP theorem to critique systems. A lot of people have taken that advice to heart, describing their systems as “CP” (consistent but not available under network partitions), “AP” (available but not consistent under network partitions), or sometimes “CA” (meaning “I still haven’t read Coda’s post from almost 5 years ago”).

I agree with all of Jeff’s other points, but with regard to the CAP theorem, I must disagree. The CAP theorem is too simplistic and too widely misunderstood to be of much use for characterizing systems. Therefore I ask that we retire all references to the CAP theorem, stop talking about the CAP theorem, and put the poor thing to rest. Instead, we should use more precise terminology to reason about our trade-offs.
cap  databases  storage  distcomp  ca  ap  cp  zookeeper  consistency  reliability  networking 
may 2015 by jm
Intel speeds up etcd throughput using ADR Xeon-only hardware feature
To reduce the latency impact of storing to disk, Weaver’s team looked to buffering as a means to absorb the writes and sync them to disk periodically, rather than for each entry. Tradeoffs? They knew memory buffers would help, but there would be potential difficulties with smaller clusters if they violated the stable storage requirement.

Instead, they turned to Intel’s silicon architects about features available in the Xeon line. After describing the core problem, they found out this had been solved in other areas with ADR. After some work to prove out a Linux OS supported use for this, they were confident they had a best-of-both-worlds angle. And it worked. As Weaver detailed in his CoreOS Fest discussion, the response time proved stable. ADR can grab a section of memory, persist it to disk and power it back. It can return entries back to disk and restore back to the buffer. ADR provides the ability to make small (<100MB) segments of memory “stable” enough for Raft log entries. It means it does not need battery-backed memory. It can be orchestrated using Linux or Windows OS libraries. ADR allows the capability to define target memory and determine where to recover. It can also be exposed directly into libs for runtimes like Golang. And it uses silicon features that are accessible on current Intel servers.
kubernetes  coreos  adr  performance  intel  raft  etcd  hardware  linux  persistence  disk  storage  xeon 
may 2015 by jm
Call me maybe: Aerospike
'Aerospike offers phenomenal latencies and throughput -- but in terms of data safety, its strongest guarantees are similar to Cassandra or Riak in Last-Write-Wins mode. It may be a safe store for immutable data, but updates to a record can be silently discarded in the event of network disruption. Because Aerospike’s timeouts are so aggressive–on the order of milliseconds -- even small network hiccups are sufficient to trigger data loss. If you are an Aerospike user, you should not expect “immediate”, “read-committed”, or “ACID consistency”; their marketing material quietly assumes you have a magical network, and I assure you this is not the case. It’s certainly not true in cloud environments, and even well-managed physical datacenters can experience horrible network failures.'
aerospike  outages  cap  testing  jepsen  aphyr  databases  storage  reliability 
may 2015 by jm
Call me maybe: Elasticsearch 1.5.0
tl;dr: Elasticsearch still hoses data integrity on partition, badly
elasticsearch  reliability  data  storage  safety  jepsen  testing  aphyr  partition  network-partitions  cap 
may 2015 by jm
HashiCorp's take on the secrets-storage system. looks good
hashicorp  deployment  security  secrets  authentication  vault  storage  keys  key-rotation 
april 2015 by jm
Making Pinterest — Learn to stop using shiny new things and love MySQL
'The third reason people go for shiny is because older tech isn’t advertised as aggressively as newer tech. The younger companies needs to differentiate from the old guard and be bolder, more passionate and promise to fulfill your wildest dreams. But most new tech sales pitches aren’t generally forthright about their many failure modes. In our early days, we fell into this third trap. We had a lot of growing pains as we scaled the architecture. The most vocal and excited database companies kept coming to us saying they’d solve all of our scalability problems. But nobody told us of the virtues of MySQL, probably because MySQL just works, and people know about it.'

It's true! -- I'm still a happy MySQL user for some use cases, particularly read-mostly relational configuration data...
mysql  storage  databases  reliability  pinterest  architecture 
april 2015 by jm
Bigcommerce Status Page blasts IBM Softlayer Object Storage service
This is pretty heavy stuff:
Bigcommerce engineers have been very pro-active in working with our storage provider, IBM Softlayer, in finding solutions. Unfortunately, it takes two parties to come to a solution. In this case, IBM Softlayer intentionally let their Object Storage cluster fall into disrepair and chose not to scale it. This has impacted Bigcommerce, IBM and many other Softlayer customers. Our engineers placed too much trust in IBM Softlayer and that's on us. However, the catastrophic failures to see metrics and rapidly scale capacity, the decisions to let hard drives sit at 90% utilization for weeks and months, the cascading failures of an undersized cluster of 52 nodes for the busiest data center in their business speaks to IBM Softlayer’s lack of concern for their customers. We found this out 3 days ago.

(via Oisin)
softlayer  bigcommerce  outages  shambles  ibm  fail  object-storage  storage  iaas  cloud 
april 2015 by jm
How We Scale VividCortex's Backend Systems - High Scalability
Excellent post from Baron Schwartz about their large-scale, 1-second-granularity time series database storage system
time-series  tsd  storage  mysql  sql  baron-schwartz  ops  performance  scalability  scaling  go 
march 2015 by jm
Kafka best practices
This is the second part of our guide on streaming data and Apache Kafka. In part one I talked about the uses for real-time data streams and explained our idea of a stream data platform. The remainder of this guide will contain specific advice on how to go about building a stream data platform in your organization.

tl;dr: limit the number of Kafka clusters; use Avro.
architecture  kafka  storage  streaming  event-processing  avro  schema  confluent  best-practices  tips 
march 2015 by jm
Backblaze Vaults: Zettabyte-Scale Cloud Storage Architecture
Backblaze deliver their take on nearline storage:

'Backblaze’s cloud storage Vaults deliver 99.99999% annual durability, horizontal scalability, and 20 Gbps of per-Vault performance, while being operationally efficient and extremely cost effective. Driven from the same mindset that we brought to the storage market with Backblaze Storage Pods, Backblaze Vaults continue our singular focus of building the most cost-efficient cloud storage around.'
architecture  backup  storage  backblaze  nearline  offline  reed-solomon  error-correction 
march 2015 by jm
Goodbye MongoDB, Hello PostgreSQL
Another core problem we’ve faced is one of the fundamental features of MongoDB (or any other schemaless storage engine): the lack of a schema. The lack of a schema may sound interesting, and in some cases it can certainly have its benefits. However, for many the usage of a schemaless storage engine leads to the problem of implicit schemas. These schemas aren’t defined by your storage engine but instead are defined based on application behaviour and expectations.

Well, don't say we didn't warn you ;)
mongodb  mysql  postgresql  databases  storage  schemas  war-stories 
march 2015 by jm
Pinterest's highly-available configuration service
Stored on S3, update notifications pushed to clients via Zookeeper
s3  zookeeper  ha  pinterest  config  storage 
march 2015 by jm
Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth | USENIX
Erasure codes, such as Reed-Solomon (RS) codes, are increasingly being deployed as an alternative to data-replication for fault tolerance in distributed storage systems. While RS codes provide significant savings in storage space, they can impose a huge burden on the I/O and network resources when reconstructing failed or otherwise unavailable data. A recent class of erasure codes, called minimum-storage-regeneration (MSR) codes, has emerged as a superior alternative to the popular RS codes, in that it minimizes network transfers during reconstruction while also being optimal with respect to storage and reliability. However, existing practical MSR codes do not address the increasingly important problem of I/O overhead incurred during reconstructions, and are, in general, inferior to RS codes in this regard. In this paper, we design erasure codes that are simultaneously optimal in terms of I/O, storage, and network bandwidth. Our design builds on top of a class of powerful practical codes, called the product-matrix-MSR codes. Evaluations show that our proposed design results in a significant reduction the number of I/Os consumed during reconstructions (a 5 reduction for typical parameters), while retaining optimality with respect to storage, reliability, and network bandwidth.
erasure-coding  reed-solomon  compression  reliability  reconstruction  replication  fault-tolerance  storage  bandwidth  usenix  papers 
february 2015 by jm
0x74696d | Falling In And Out Of Love with DynamoDB, Part II
Good DynamoDB real-world experience post, via Mitch Garnaat. We should write up ours, although it's pretty scary-stuff-free by comparison
aws  dynamodb  storage  databases  architecture  ops 
february 2015 by jm
Personalization at Spotify using Cassandra
Lots and lots of good detail into the Spotify C* setup (via Bill de hOra)
via:dehora  spotify  cassandra  replication  storage  ops 
january 2015 by jm
Good advice on running large-scale database stress tests
I've been bitten by poor key distribution in tests in the past, so this is spot on: 'I'd run it with Zipfian, Pareto, and Dirac delta distributions, and I'd choose read-modify-write transactions.'

And of course, a dataset bigger than all combined RAM.

Also: -- the "Biebermark", where just a single row out of the entire db is contended on in a read/modify/write transaction: "the inspiration for this is maintaining counts for [highly contended] popular entities like Justin Bieber and One Direction."
biebermark  benchmarks  testing  performance  stress-tests  databases  storage  mongodb  innodb  foundationdb  aphyr  measurement  distributions  keys  zipfian 
december 2014 by jm
If Eventual Consistency Seems Hard, Wait Till You Try MVCC
ex-Percona MySQL wizard Baron Schwartz, noting that MVCC as implemented in common SQL databases is not all that simple or reliable compared to big bad NoSQL Eventual Consistency:
Since I am not ready to assert that there’s a distributed system I know to be better and simpler than eventually consistent datastores, and since I certainly know that InnoDB’s MVCC implementation is full of complexities, for right now I am probably in the same position most of my readers are: the two viable choices seem to be single-node MVCC and multi-node eventual consistency. And I don’t think MVCC is the simpler paradigm of the two.
nosql  concurrency  databases  mysql  riak  voldemort  eventual-consistency  reliability  storage  baron-schwartz  mvcc  innodb  postgresql 
december 2014 by jm
"Macaroons" for fine-grained secure database access
Macaroons are an excellent fit for NoSQL data storage for several reasons. First, they enable an application developer to enforce security policies at very fine granularity, per object. Gone are the clunky security policies based on the IP address of the client, or the per-table access controls of RDBMSs that force you to split up your data across many tables. Second, macaroons ensure that a client compromise does not lead to loss of the entire database. Third, macaroons are very flexible and expressive, able to incorporate information from external systems and third-party databases into authorization decisions. Finally, macaroons scale well and are incredibly efficient, because they avoid public-key cryptography and instead rely solely on fast hash functions.
security  macaroons  cookies  databases  nosql  case-studies  storage  authorization  hyperdex 
november 2014 by jm
Smart Clients, haproxy, and Riak
Good, thought-provoking post on good client library approaches for complex client-server systems, particularly distributed stores like Voldemort or Riak. I'm of the opinion that a smart client lib is unavoidable, and in fact essential, since the clients are part of the distributed system, personally.
clients  libraries  riak  voldemort  distsys  haproxy  client-server  storage 
october 2014 by jm
a Riak-based clone of Roshi, the CRDT server built on top of Redis. some day I'll write up the CRDT we use on top of Voldemort in $work.

riak  roshi  crdt  redis  storage  time-series-data 
october 2014 by jm Pulse
Syncthing is becoming Pulse. Pulse replaces proprietary sync and cloud services with something open, trustworthy and decentralised. Your data is your data alone and you deserve to choose where it is stored, if it is shared with some third party, and how it's transmitted over the Internet.
syncing  storage  cloud  dropbox  utilities  gpl  decentralization 
october 2014 by jm
Mnesia and CAP
A common “trick” is to claim:

'We assume network partitions can’t happen. Therefore, our system is CA according to the CAP theorem.'

This is a nice little twist. By asserting network partitions cannot happen, you just made your system into one which is not distributed. Hence the CAP theorem doesn’t even apply to your case and anything can happen. Your system may be linearizable. Your system might have good availability. But the CAP theorem doesn’t apply. [...]
In fact, any well-behaved system will be “CA” as long as there are no partitions. This makes the statement of a system being “CA” very weak, because it doesn’t put honesty first. I tries to avoid the hard question, which is how the system operates under failure. By assuming no network partitions, you assume perfect information knowledge in a distributed system. This isn’t the physical reality.
cap  erlang  mnesia  databases  storage  distcomp  reliability  ca  postgres  partitions 
october 2014 by jm
mcrouter: A memcached protocol router for scaling memcached deployments
New from Facebook engineering:
Last year, at the Data@Scale event and at the USENIX Networked Systems Design and Implementation conference , we spoke about turning caches into distributed systems using software we developed called mcrouter (pronounced “mick-router”). Mcrouter is a memcached protocol router that is used at Facebook to handle all traffic to, from, and between thousands of cache servers across dozens of clusters distributed in our data centers around the world. It is proven at massive scale — at peak, mcrouter handles close to 5 billion requests per second. Mcrouter was also proven to work as a standalone binary in an Amazon Web Services setup when Instagram used it last year before fully transitioning to Facebook's infrastructure.

Today, we are excited to announce that we are releasing mcrouter’s code under an open-source BSD license. We believe it will help many sites scale more easily by leveraging Facebook’s knowledge about large-scale systems in an easy-to-understand and easy-to-deploy package.

This is pretty crazy -- basically turns a memcached cluster into a much more usable clustered-storage system, with features like shadowing production traffic, cold cache warmup, online reconfiguration, automatic failover, prefix-based routing, replicated pools, etc. Lots of good features.
facebook  scaling  cache  proxy  memcache  open-source  clustering  distcomp  storage 
september 2014 by jm
Aerospike's CA boast gets a thumbs-down from @aphyr
Specifically, @aerospikedb cannot offer cursor stability, repeatable read, snapshot isolation, or any flavor of serializability.
@nasav @aerospikedb At *best* you can offer Read Committed, which is not, I assert, what most people would expect from an "ACID" database.
aphyr  aerospike  availability  consistency  acid  transactions  distcomp  databases  storage 
september 2014 by jm
The Myth of Schema-less [NoSQL]
We don't seem to gain much in terms of database flexibility. Is our application more flexible? I don't think so. Even without our schema explicitly defined in our database, it's there... somewhere. You simply have to search through hundreds of thousands of lines to find all the little bits of it. It has the potential to be in several places, making it harder to properly identify. The reality of these codebases is that they are error prone and rarely lack the necessary documentation. This problem is magnified when there are multiple codebases talking to the same database. This is not an uncommon practice for reporting or analytical purposes.

Finally, all this "flexibility" rears its head in the same way that PHP and Javascript's "neat" weak typing stabs you right in the face. There are some somethings you can be cavalier about, and some things you should be strict about. Your data model is one you absolutely need to be strict on. If a field should store an int, it should store nothing else. Not a string, not a picture of a horse, but an integer. It's nice to know that I have my database doing type checking for me and I can expect a field to be the same type across all records.

All this leads us to an undeniable fact: There is always a schema. Wearing "I don't do schema" as a badge of honor is a complete joke and encourages a terrible development practice.
nosql  databases  storage  schema  strong-typing 
july 2014 by jm
Chef Vault
A way to securely store secrets (auth details, API keys, etc.) in Chef
chef  storage  knife  authorisation  api-keys  security  encryption 
june 2014 by jm
Benchmarking LevelDB vs. RocksDB vs. HyperLevelDB vs. LMDB Performance for InfluxDB
A few interesting things come out of these results. LevelDB is the winner on disk space utilization, RocksDB is the winner on reads and deletes, and HyperLevelDB is the winner on writes. On smaller runs (30M or less), LMDB came out on top on most of the metrics except for disk size. This is actually what we’d expect for B-trees: they’re faster the fewer keys you have in them.

Mind you, I'd prefer if this had tunable read/write/delete ratios, as YCSB does. Take with a pinch of salt, as with all benchmarks!
benchmarks  leveldb  datastores  storage  hyperleveldb  rocksdb  ycsb  lmdb  influxdb 
june 2014 by jm
Call me maybe: Elasticsearch
Wow, these are terrible results. From the sounds of it, ES just cannot deal with realistic outage scenarios and is liable to suffer catastrophic damage in reasonably-common partitions.
If you are an Elasticsearch user (as I am): good luck. Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present. If you can, store your data in a safer database, and feed it into Elasticsearch gradually. Have processes in place that continually traverse the system of record, so you can recover from ES data loss automatically.
elasticsearch  ops  storage  databases  jepsen  partition  network  outages  reliability 
june 2014 by jm
Cap'n Proto, FlatBuffers, and SBE
a feature comparison of these new serialization formats from Kenton, the capnp dude
serialization  protobuf  capnproto  sbe  flatbuffers  google  coding  storage 
june 2014 by jm
FlatBuffers: Main Page
A new serialization format from Google's Android gaming team, supporting C++ and Java, open source under the ASL v2. Reasons to use it:
Access to serialized data without parsing/unpacking - What sets FlatBuffers apart is that it represents hierarchical data in a flat binary buffer in such a way that it can still be accessed directly without parsing/unpacking, while also still supporting data structure evolution (forwards/backwards compatibility).
Memory efficiency and speed - The only memory needed to access your data is that of the buffer. It requires 0 additional allocations. FlatBuffers is also very suitable for use with mmap (or streaming), requiring only part of the buffer to be in memory. Access is close to the speed of raw struct access with only one extra indirection (a kind of vtable) to allow for format evolution and optional fields. It is aimed at projects where spending time and space (many memory allocations) to be able to access or construct serialized data is undesirable, such as in games or any other performance sensitive applications. See the benchmarks for details.
Flexible - Optional fields means not only do you get great forwards and backwards compatibility (increasingly important for long-lived games: don't have to update all data with each new version!). It also means you have a lot of choice in what data you write and what data you don't, and how you design data structures.
Tiny code footprint - Small amounts of generated code, and just a single small header as the minimum dependency, which is very easy to integrate. Again, see the benchmark section for details.
Strongly typed - Errors happen at compile time rather than manually having to write repetitive and error prone run-time checks. Useful code can be generated for you.
Convenient to use - Generated C++ code allows for terse access & construction code. Then there's optional functionality for parsing schemas and JSON-like text representations at runtime efficiently if needed (faster and more memory efficient than other JSON parsers).

Looks nice, but it misses the language coverage of protobuf. Definitely more practical than capnproto.
c++  google  java  serialization  json  formats  protobuf  capnproto  storage  flatbuffers 
june 2014 by jm
SSD shadiness: Kingston and PNY caught bait-and-switching cheaper components after good reviews | ExtremeTech
Imagine buying a high-end Core i7 or AMD CPU, opening the box, and finding a midrange part sitting there with an asterisk and the label “Performs Just Like Our High End CPU In Single-Threaded SuperPi!”
ssd  storage  hardware  sketchy  kingston  pny  bait-and-switch  components  vendors  via:hn 
june 2014 by jm
a distributed key/value datastore which supports ACID transactional semantics and versioned values as first-class features. The primary design goal is global consistency and survivability, hence the name. Cockroach aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. Cockroach nodes are symmetric; a design goal is one binary with minimal configuration and no required auxiliary services.

Cockroach implements a single, monolithic sorted map from key to value where both keys and values are byte strings (not unicode). Cockroach scales linearly (theoretically up to 4 exabytes (4E) of logical data). The map is composed of one or more ranges and each range is backed by data stored in RocksDB (a variant of LevelDB), and is replicated to a total of three or more cockroach servers. Ranges are defined by start and end keys. Ranges are merged and split to maintain total byte size within a globally configurable min/max size interval. Range sizes default to target 64M in order to facilitate quick splits and merges and to distribute load at hotspots within a key range. Range replicas are intended to be located in disparate datacenters for survivability (e.g. { US-East, US-West, Japan }, { Ireland, US-East, US-West}, { Ireland, US-East, US-West, Japan, Australia }).

Single mutations to ranges are mediated via an instance of a distributed consensus algorithm to ensure consistency. We’ve chosen to use the Raft consensus algorithm. All consensus state is stored in RocksDB.

A single logical mutation may affect multiple key/value pairs. Logical mutations have ACID transactional semantics. If all keys affected by a logical mutation fall within the same range, atomicity and consistency are guaranteed by Raft; this is the fast commit path. Otherwise, a non-locking distributed commit protocol is employed between affected ranges.

Cockroach provides snapshot isolation (SI) and serializable snapshot isolation (SSI) semantics, allowing externally consistent, lock-free reads and writes--both from an historical snapshot timestamp and from the current wall clock time. SI provides lock-free reads and writes but still allows write skew. SSI eliminates write skew, but introduces a performance hit in the case of a contentious system. SSI is the default isolation; clients must consciously decide to trade correctness for performance. Cockroach implements a limited form of linearalizability, providing ordering for any observer or chain of observers.

This looks nifty. One to watch.
cockroachdb  databases  storage  georeplication  raft  consensus  acid  go  key-value-stores  rocksdb 
may 2014 by jm
« earlier      
per page:    204080120160

related tags

2i  acid  acidrain  acm  adr  aerospike  akka  alert-logic  algorithms  amazon  analog  analytics  ap  apache  aphyr  api-keys  app-engine  apyhr  architecture  archive  arq  arrays  asl2  atomic  aurora  authentication  authorisation  authorization  autoremediation  availability  avro  aws  b+trees  b-trees  b2  backblaze  backup  backups  bait-and-switch  bandwidth  baron-schwartz  basho  bdb  bdb-je  beans  belgium  ben-stopford  benchmarks  beringei  berkeley-db  best-practices  bet365  biebermark  big-data  big-o  bigcommerce  bigtable  billing  bit-errors  bittorrent  blake2  bloom-filter  blu-ray  boundary  btrfs  bump  bw-trees  c++  ca  cache  cache-eviction  cache-friendly  caching  calm  cap  cap-theorem  capnproto  carbon  case-studies  cassandra  cbf  cdt  cep  chaos-monkey  character-sets  charity-majors  chef  cleaner  clickhouse  client-server  clients  clojure  cloud  cloud-storage  cloudera  clustering  cockroachdb  coding  coffee  column-oriented  columnar-storage  columnar-stores  columns  comcast  commutativity  components  compression  concurrency  config  confluent  consensus  consistency  converter  cookies  copysets  coreos  corruption  couch  couchdb  counters  counting  cp  cqrs  crdt  crdts  cross-region  cuckoo-hashing  d-left-hashing  data  data-centers  data-loss  data-store  data-stores  data-structures  database  databases  datadog  datastax  datastores  datomic  dax  dba  debug  decentralization  delta-delta-coding  deltas  deployment  disk  disks  distcomp  distributed  distributed-systems  distributions  distsys  documents  dpdk  dremel  drivers  dropbox  druid  durability  dvd  dynamo  dynamodb  ebs  ec2  elasticache  elasticsearch  encoding  encryption  erasure-coding  erlang  error-correction  etcd  event-processing  eventual-consistency  evernote  expiry  export  facebook  fail  failover  failure  failures  false-positives  fault-tolerance  files  filesharing  filesystems  fileupload  fire  fire-suppression  flash  flatbuffers  food  formats  foundationdb  freezing  fs  fsync  funding  future  gae  gc  georeplication  gfs  gilt  gilt-groupe  git  github  gizzard  go  google  google-maps  google-photos  google-storage  gpl  grape  graphite  graphs  ha  hadoop  haproxy  hard-drives  hardware  hash-tables  hashicorp  hashing  hbase  hdds  hdfs  hdrhistogram  histograms  history  hive  home  horizontal-scaling  hrd  hydro  hyperdex  hyperleveldb  iaas  ibm  idempotency  images  in-memory  indexes  influxdb  ing  inmobi  innodb  instagram  intel  inter-region  iops  islands  isolation  java  jay-kreps  jeff-darcy  jepsen  json  jvm  kafka  kellabyte  key-rotation  key-value  key-value-stores  keys  kingston  knife  kobayashi  kubernetes  lambda-architecture  latency  leveldb  libraries  licensing  lifespan  linkedin  linux  lmdb  lock-free  log  low-latency  lsm  lsm-trees  macaroons  macosx  manmade  map-reduce  mapping  maps  marc-brooker  mariadb  marshalling  measurement  mechanical-sympathy  media  memcache  memcached  memory  metrics  mica  microsoft  migrations  mnesia  mongodb  monitoring  mtbf  multicore  mvcc  mysql  nas  nbta  nearline  nelson  netflix  network  network-partitions  networking  nosql  object-storage  off-heap  offline  olap  online-backup  open-source  openldap  openstack  ops  oracle  os  oss  outages  pacelc  paldb  paper  papers  parquet  partition  partitions  patches  paxos  pdf  percona  performance  persistence  persistent  persistent-memory  peter-bailis  photos  pinterest  piops  pny  postgres  postgresql  postmortems  power  presentations  protobuf  proxy  puns  python  queue  queueing  rackspace  raft  raid  ram  range-queries  read-only  realtime  reconstruction  record-shredding  recovery  redis  redundancy  reed-solomon  reference  reliability  remediation  repair  replicas  replication  research  rethinkdb  retriever  revision-control  riak  riak-cs  ricon  rocksdb  rollback  roshi  s3  safety  sbe  scala  scalability  scaling  schema  schemaless  schemas  scuba  scylla  sea  seagate  secrets  security  semilattice  sergio-bossa  serialization  service-metrics  services  set-storage  shadow-stack  shambles  sharding  sharing  shorn-writes  side-data  silos  simon-willison  simpledb  sirius  sketchy  slides  smp  snapstream  softlayer  source-control  space  spanner  spanning  sparkey  spof  spotify  sql  sql-server  ssd  ssds  startups  state  stateful-services  storage  stores  streaming  stress-tests  strong-typing  structs  submodules  sync  synchronization  syncing  sysadmin  systems  television  tellybug  testimonials  testing  tests  throughput  tideways  tiered-storage  tiling  time-series  time-series-data  timeseries  tips  tonx  toread  torus  transactions  transactor  transcoding  trees  tsd  tsm-trees  ttl  tuning  tv  twilio  twitter  uber  unix  usenix  user-stories  utf-8  utf8mb4  utilities  vault  vendors  version-control  via:charitymajors  via:daev  via:dehora  via:donncha  via:fanf  via:filippo  via:highscalability  via:hn  via:martin-thompson  via:nelson  via:peakscale  via:rbranson  via:simonw  vibration  video  voldemort  vulnerability  war-stories  whisper  wind-power  xeon  yandex  ycsb  zfs  zipfian  zookeeper 

Copy this bookmark: