jm + raid   6

Introduction to HDFS Erasure Coding in Apache Hadoop
How Hadoop did EC. Erasure Coding support ("HDFS-EC") is set to be released in Hadoop 3.0 apparently
erasure-coding  reed-solomon  algorithms  hadoop  hdfs  cloudera  raid  storage 
september 2015 by jm
Latest EBS tuning tips
from yesterday's AWS Summit in NYC:

Cheat sheet of EBS-optimized instances.
Optimize your queue depth to achieve lower latency & highest IOPS.
When configuring your RAID, use a stripe size of 128KB or 256KB.
Use larger block size to speed up the pre-warming process.
ebs  aws  amazon  iops  raid  ops  tuning 
july 2014 by jm
Migrating from MongoDB to Cassandra
Interesting side-effect of using LUKS for full-disk encryption: 'For every disk read, we were pulling in 3MB of data (RA is sectors, SSZ is sector size, 6144*512=3145728 bytes) into cache. Oops. Not only were we doing tons of extra work, but we were trashing our page cache too. The default for the device-mapper used by LUKS under Ubuntu 12.04LTS is incredibly sub-optimal for database usage, especially our usage of Cassandra (more small random reads vs. large rows). We turned this down to 128 sectors — 64KB.'
cassandra  luks  raid  linux  tuning  ops  blockdev  disks  sdd 
february 2014 by jm
from the Percona toolkit. 'Conveniently summarizes the status and configuration of a server. It is not a tuning tool or diagnosis tool. It produces a report that is easy to diff and can be pasted into emails without losing the formatting. This tool works well on many types of Unix systems.' --- summarises OOM history, top, netstat connection table, interface stats, network config, RAID, LVM, disks, inodes, disk scheduling, mounts, memory, processors, and CPU.
percona  tools  cli  unix  ops  linux  diagnosis  raid  netstat  oom 
october 2013 by jm
how RAID fits in with Riak
Write heavy, high performance applications should probably use RAID 0 or avoid RAID altogether and consider using a larger n_val and cluster size. Read heavy applications have more options, and generally demand more fault tolerance with the added benefit of easier hardware replacement procedures.

Good to see official guidance on this (via Bill de hOra)
via:dehora  riak  cluster  fault-tolerance  raid  ops 
june 2013 by jm
KDE's brush with git repository corruption: post-mortem
a barely-averted disaster... phew.

while we planned for the case of the server losing a disk or entirely biting the dust, or the total loss of the VM’s filesystem, we didn’t plan for the case of filesystem corruption, and the way the corruption affected our mirroring system triggered some very unforeseen and pathological conditions. [...] the corruption was perfectly mirrored... or rather, due to its nature, imperfectly mirrored. And all data on the anongit [mirrors] was lost.

One risk demonstrated: by trusting in mirroring, rather than a schedule of snapshot backups covering a wide time range, they nearly had a major outage. Silent data corruption, and code bugs, happen -- backups protect against this, but RAID, replication, and mirrors do not.

Another risk: they didn't have a rate limit on project-deletion, which resulted in the "anongit" mirrors deleting their (safe) data copies in response to the upstream corruption. Rate limiting to sanity-check automated changes is vital. What they should have had in place was described by the fix: 'If a new projects file is generated and is more than 1% different than the previous file, the previous file is kept intact (at 1500 repositories, that means 15 repositories would have to be created or deleted in the span of three minutes, which is extremely unlikely).'
rate-limiting  case-studies  post-mortems  kde  git  data-corruption  risks  mirroring  replication  raid  bugs  backups  snapshots  sanity-checks  automation  ops 
march 2013 by jm

Copy this bookmark: