jm + compression   37

How eBay’s Shopping Cart used compression techniques to solve network I/O bottlenecks
compressing data written to MongoDB using LZ4_HIGH --dropped oplog write rates from 150GB/hour to 11GB/hour. Snappy and Gzip didn't fare too well by comparison
lz4  compression  gzip  json  snappy  scaling  ebay  mongodb 
16 days ago by jm
Beringei: A high-performance time series storage engine | Engineering Blog | Facebook Code
Beringei is different from other in-memory systems, such as memcache, because it has been optimized for storing time series data used specifically for health and performance monitoring. We designed Beringei to have a very high write rate and a low read latency, while being as efficient as possible in using RAM to store the time series data. In the end, we created a system that can store all the performance and monitoring data generated at Facebook for the most recent 24 hours, allowing for extremely fast exploration and debugging of systems and services as we encounter issues in production.

Data compression was necessary to help reduce storage overhead. We considered several existing compression schemes and rejected the techniques that applied only to integer data, used approximation techniques, or needed to operate on the entire dataset. Beringei uses a lossless streaming compression algorithm to compress points within a time series with no additional compression used across time series. Each data point is a pair of 64-bit values representing the timestamp and value of the counter at that time. Timestamps and values are compressed separately using information about previous values. Timestamp compression uses a delta-of-delta encoding, so regular time series use very little memory to store timestamps.

From analyzing the data stored in our performance monitoring system, we discovered that the value in most time series does not change significantly when compared to its neighboring data points. Further, many data sources only store integers (despite the system supporting floating point values). Knowing this, we were able to tune previous academic work to be easier to compute by comparing the current value with the previous value using XOR, and storing the changed bits. Ultimately, this algorithm resulted in compressing the entire data set by at least 90 percent.
beringei  compression  facebook  monitoring  tsd  time-series  storage  architecture 
19 days ago by jm
Evolving MySQL Compression - Part 2 | Pinterest Engineering
generating a near-optimal external dictionary for Zlib deflate compression
compression  deflate  zlib  pinterest  hacks  mysql 
25 days ago by jm
lbzip2
a free, multi-threaded compression utility with support for bzip2 compressed file format. lbzip2 can process standard bz2 files in parallel. It uses POSIX threading model (pthreads), which allows it to take full advantage of symmetric multiprocessing (SMP) systems. It has been proven to scale linearly, even to over one hundred processor cores.

lbzip2 is fully compatible with bzip2 – both at file format and command line level. Files created by lbzip2 can be decompressed by all versions of bzip2 and other software supporting bz2 format. lbzip2 can decompress any bz2 files in parallel. All bzip2 command-line options are also accepted by lbzip2. This makes lbzip2 a drop-in replacement for bzip2.
bzip2  gzip  compression  lbzip2  parallel  cli  tools 
march 2016 by jm
research!rsc: Zip Files All The Way Down
quine.zip, quine.gz, and quine.tar.gz. Here's what happens when you mail it through bad AV software: https://twitter.com/FioraAeterna/status/694655296707297281
zip  algorithms  compression  quines  fun  hacks  gzip 
february 2016 by jm
GZinga
'Seekable and Splittable Gzip', from eBay
ebay  gzip  compression  seeking  streams  splitting  logs  gzinga 
october 2015 by jm
The New InfluxDB Storage Engine: A Time Structured Merge Tree
The new engine has similarities with LSM Trees (like LevelDB and Cassandra’s underlying storage). It has a write ahead log, index files that are read only, and it occasionally performs compactions to combine index files. We’re calling it a Time Structured Merge Tree because the index files keep contiguous blocks of time and the compactions merge those blocks into larger blocks of time. Compression of the data improves as the index files are compacted. Once a shard becomes cold for writes it will be compacted into as few files as possible, which yield the best compression.
influxdb  storage  lsm-trees  leveldb  tsm-trees  data-structures  algorithms  time-series  tsd  compression 
october 2015 by jm
Brotli: a new compression algorithm for the internet from Google
While Zopfli is Deflate-compatible, Brotli is a whole new data format. This new format allows us to get 20–26% higher compression ratios over Zopfli. In our study ‘Comparison of Brotli, Deflate, Zopfli, LZMA, LZHAM and Bzip2 Compression Algorithms’ we show that Brotli is roughly as fast as zlib’s Deflate implementation. At the same time, it compresses slightly more densely than LZMA and bzip2 on the Canterbury corpus. The higher data density is achieved by a 2nd order context modeling, re-use of entropy codes, larger memory window of past data and joint distribution codes. Just like Zopfli, the new algorithm is named after Swiss bakery products. Brötli means ‘small bread’ in Swiss German.
brotli  zopfli  deflate  gzip  compression  algorithms  swiss  google 
september 2015 by jm
The Netflix Test Video
Netflix' official test video -- contains various scenarios which exercise frequent tricky edge cases in video compression and playback; A/V sync, shades of black, running water, etc.
networking  netflix  streaming  video  compression  tests 
august 2015 by jm
Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth | USENIX
Erasure codes, such as Reed-Solomon (RS) codes, are increasingly being deployed as an alternative to data-replication for fault tolerance in distributed storage systems. While RS codes provide significant savings in storage space, they can impose a huge burden on the I/O and network resources when reconstructing failed or otherwise unavailable data. A recent class of erasure codes, called minimum-storage-regeneration (MSR) codes, has emerged as a superior alternative to the popular RS codes, in that it minimizes network transfers during reconstruction while also being optimal with respect to storage and reliability. However, existing practical MSR codes do not address the increasingly important problem of I/O overhead incurred during reconstructions, and are, in general, inferior to RS codes in this regard. In this paper, we design erasure codes that are simultaneously optimal in terms of I/O, storage, and network bandwidth. Our design builds on top of a class of powerful practical codes, called the product-matrix-MSR codes. Evaluations show that our proposed design results in a significant reduction the number of I/Os consumed during reconstructions (a 5 reduction for typical parameters), while retaining optimality with respect to storage, reliability, and network bandwidth.
erasure-coding  reed-solomon  compression  reliability  reconstruction  replication  fault-tolerance  storage  bandwidth  usenix  papers 
february 2015 by jm
Listen to a song made from data lost during MP3 conversion
Ryan McGuire, a PhD student in Composition and Computer Technologies at the University of Virginia Center for Computer Music, has created the project The Ghost In The MP3 [....] For his first trick, McGuire took Suzanne Vega’s ‘Tom’s Diner’ and drained it into a vaporous piece titled ‘moDernisT.” McGuire chose the track he explains on his site because it was famously used as one of the main controls in the listening tests used to develop the MP3 algorithm.
mp3  music  suzanne-vega  compression 
february 2015 by jm
Roaring Bitmaps
Bitsets, also called bitmaps, are commonly used as fast data structures. Unfortunately, they can use too much memory. To compensate, we often use compressed bitmaps. Roaring bitmaps are compressed bitmaps which tend to outperform conventional compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can be hundreds of times faster and they often offer significantly better compression.

Roaring bitmaps are used in Apache Lucene (as of version 5.0 using an independent implementation) and Apache Spark (as of version 1.2).
bitmaps  bitsets  sets  data-structures  bits  compression  lucene  spark  daniel-lemire  algorithms 
november 2014 by jm
Breaking Spotify DRM with PANDA
Reverse engineering a DRM implementation, by instrumenting a VM and performing entropy/compressability analysis on function call inputs and outputs. Impressive
reversing  spotify  drm  panda  vm  compression  entropy  compressability  qemu  via:hn 
july 2014 by jm
"A New Data Structure For Cumulative Frequency Tables"
paper by Peter M Fenwick, 1993. 'A new method (the ‘binary indexed tree’) is presented for maintaining the cumulative frequencies which are needed to support dynamic arithmetic data compression. It is based on a decomposition of the cumulative frequencies into portions which parallel the binary representation of the index of the table element (or symbol). The operations to traverse the data structure are based on the binary coding of the index. In comparison with previous methods, the binary indexed tree is faster, using more compact data and simpler code. The access time for all operations is either constant or proportional to the logarithm of the table size. In conjunction with the compact data structure, this makes the new method particularly suitable for large symbol alphabets.'

via Jakob Buchgraber, who's implementing it right now in Netty ;)
netty  frequency-tables  data-structures  algorithms  coding  binary-tree  indexing  compression  symbol-alphabets 
may 2014 by jm
FM-index
a compressed full-text substring index based on the Burrows-Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,[1] who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for 'Full-text index in Minute space'. It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence. Both the query time and storage space requirements are sublinear with respect to the size of the input data.


kragen notes 'gene sequencing is using [them] in production'.
sequencing  bioinformatics  algorithms  bowtie  fm-index  indexing  compression  search  burrows-wheeler  bwt  full-text-search 
march 2014 by jm
Sux
Some basic succinct data structures. [...] The main highlights are:
a novel, broadword-based implementation of rank/select queries for up to 264 bits that is highly competitive with known 32-bit implementations on 64-bit architectures (additional space required is 25% for ranking and 12.5%-37.5% for selection);
several Java structures using the Elias–Fano representation of monotone sequences for storing pointers, variable-length bit arrays, etc.
Java code implementing minimal perfect hashing using around 2.68 bits per element (also using some broadword ideas);
a few Java implementations of monotone minimal perfect hashing.
Sux is free software distributed under the GNU Lesser General Public License.
sux  succinct  data-structures  bits  compression  space  coding 
january 2014 by jm
Harry - A Tool for Measuring String Similarity
a small tool for comparing strings and measuring their similarity. The tool supports several common distance and kernel functions for strings as well as some exotic similarity measures. The focus of Harry lies on implicit similarity measures, that is, comparison functions that do not give rise to an explicit vector space. Examples of such similarity measures are the Levenshtein distance and the Jaro-Winkler distance.
For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output. The similarity measure can be computed based on the granularity of characters as well as words contained in the strings. The configuration of this process, such as the input format, the similarity measure and the output format, are specified in a configuration file and can be additionally refined using command-line options.
Harry is implemented using OpenMP, such that the computation time for a set of strings scales linear with the number of available CPU cores. Moreover, efficient implementations of several similarity measures, effective caching of similarity values and low-overhead locking further speedup the computation.


via kragen.
via:kragen  strings  similarity  levenshtein-distance  algorithms  openmp  jaro-winkler  edit-distance  cli  commandline  hamming-distance  compression 
january 2014 by jm
Google Fonts recently switched to using Zopfli
Google Fonts recently switched to using new Zopfli compression algorithm:  the fonts are ~6% smaller on average, and in some cases up to 15% smaller! [...]
What's Zopfli? It's an algorithm that was developed by the compression team at Google that delivers ~3~8% bytesize improvement when compared to gzip with maximum compression. This byte savings comes at a cost of much higher encoding cost, but the good news is, fonts are static files and decompression speed is exactly the same. Google Fonts pays the compression cost once and every clients gets the benefit of smaller download. If you’re curious to learn more about Zopfli: http://bit.ly/Y8DEL4
zopfli  compression  gzip  fonts  google  speed  optimization 
january 2014 by jm
Finite State Entropy coding
As can be guessed, the higher the compression ratio, the more efficient FSE becomes compared to Huffman, since Huffman can't break the "1 bit per symbol" limit. FSE speed is also very stable, under all probabilities. I'm quite please with the result, especially considering that, since the invention of arithmetic coding in the 70's, nothing really new has been brought to this field.

This is still beta stuff, so please consider this first release for testing purposes mostly.


Looking forward to this making it into a production release of some form.
compression  algorithms  via:kragen  fse  finite-state-entropy-coding  huffman  arithmetic-coding 
january 2014 by jm
Ivan Ristić: Defending against the BREACH attack
One interesting response to this HTTPS compression-based MITM attack:
The award for least-intrusive and entirely painless mitigation proposal goes to Paul Querna who, on the httpd-dev mailing list, proposed to use the HTTP chunked encoding to randomize response length. Chunked encoding is a HTTP feature that is typically used when the size of the response body is not known in advance; only the size of the next chunk is known. Because chunks carry some additional information, they affect the size of the response, but not the content. By forcing more chunks than necessary, for example, you can increase the length of the response. To the attacker, who can see only the size of the response body, but not anything else, the chunks are invisible. (Assuming they're not sent in individual TCP packets or TLS records, of course.) This mitigation technique is very easy to implement at the web server level, which makes it the least expensive option. There is only a question about its effectiveness. No one has done the maths yet, but most seem to agree that response length randomization slows down the attacker, but does not prevent the attack entirely. But, if the attack can be slowed down significantly, perhaps it will be as good as prevented.
mitm  attacks  hacking  security  compression  http  https  protocols  tls  ssl  tcp  chunked-encoding  apache 
august 2013 by jm
Xerox scanners/photocopiers randomly alter numbers in scanned documents · D. Kriesel
Pretty major Xerox fail: photocopied/scanned docs are found to have replaced the digit '6' with '8', due to a poor choice of compression techniques:
Several mails I got suggest that the xerox machines use JBIG2 for compression. This algorithm creates a dictionary of image patches it finds “similar”. Those patches then get reused instead of the original image data, as long as the error generated by them is not “too high”. Makes sense. This also would explain, why the error occurs when scanning letters or numbers in low resolution (still readable, though). In this case, the letter size is close to the patch size of JBIG2, and whole “similar” letters or even letter blocks get replaced by each other.
jbig2  compression  xerox  photocopying  scanning  documents  fonts  arial  image-compression  images 
august 2013 by jm
Compression in Kafka: GZIP or Snappy ?
With Ack: in this mode, as far as compression is concerned, the data gets compressed at the producer, decompressed and compressed on the broker before it sends the ack to the producer. The producer throughput with Snappy compression was roughly 22.3MB/s as compared to 8.9MB/s of the GZIP producer. Producer throughput is 150% higher with Snappy as compared to GZIP.

No ack, similar to Kafka 0.7 behavior: In this mode, the data gets compressed at the producer and it doesn’t wait for the ack from the broker. The producer throughput with Snappy compression was roughly 60.8MB/s as compared to 18.5MB/s of the GZIP producer. Producer throughput is 228% higher with Snappy as compared to GZIP. The higher compression savings in this test are due to the fact that the producer does not wait for the leader to re-compress and append the data; it simply compresses messages and fires away. Since Snappy has very high compression speed and low CPU usage, a single producer is able to compress the same amount of messages much faster as compared to GZIP.
gzip  snappy  compression  kafka  streaming  ops 
april 2013 by jm
javaewah
The bit array data structure is implemented in Java as the BitSet class. Unfortunately, this fails to scale without compression. JavaEWAH is a word-aligned compressed variant of the Java bitset class. It uses a 64-bit run-length encoding (RLE) compression scheme. We trade-off some compression for better processing speed. We also have a 32-bit version which compresses better, but is not as fast.

In general, the goal of word-aligned compression is not to achieve the best compression, but rather to improve query processing time. Hence, we try to save CPU cycles, maybe at the expense of storage. However, the EWAH scheme we implemented is always more efficient storage-wise than an uncompressed bitmap (as implemented in the BitSet class). Unlike some alternatives, javaewah does not rely on a patented scheme.
javaewah  wah  rle  compression  bitmaps  bitmap-indexes  bitset  algorithms  data-structures 
april 2013 by jm
FastBit: An Efficient Compressed Bitmap Index Technology
an [LGPL] open-source data processing library following the spirit of NoSQL movement. It offers a set of searching functions supported by compressed bitmap indexes. It treats user data in the column-oriented manner similar to well-known database management systems such as Sybase IQ, MonetDB, and Vertica. It is designed to accelerate user's data selection tasks without imposing undue requirements. In particular, the user data is NOT required to be under the control of FastBit software, which allows the user to continue to use their existing data analysis tools.

The key technology underlying the FastBit software is a set of compressed bitmap indexes. In database systems, an index is a data structure to accelerate data accesses and reduce the query response time. Most of the commonly used indexes are variants of the B-tree, such as B+-tree and B*-tree. FastBit implements a set of alternative indexes called compressed bitmap indexes. Compared with B-tree variants, these indexes provide very efficient searching and retrieval operations, but are somewhat slower to update after a modification of an individual record.

A key innovation in FastBit is the Word-Aligned Hybrid compression (WAH) for the bitmaps.[...] Another innovation in FastBit is the multi-level bitmap encoding methods.
fastbit  nosql  algorithms  indexing  search  compressed-bitmaps  indexes  wah  bitmaps  compression 
april 2013 by jm
Cap'n Proto
Cap’n Proto is an insanely fast data interchange format and capability-based RPC system. Think JSON, except binary. Or think Protocol Buffers, except faster. In fact, in benchmarks, Cap’n Proto is INFINITY TIMES faster than Protocol Buffers.


Basically, marshalling like writing an aligned C struct to the wire, QNX messaging protocol-style. Wasteful on space, but responds to this by suggesting compression (which is a fair point tbh). C++-only for now. I'm not seeing the same kind of support for optional data that protobufs has though. Overall I'm worried there's some useful features being omitted here...
serialization  formats  protobufs  capn-proto  protocols  coding  c++  rpc  qnx  messaging  compression  compatibility  interoperability  i14y 
april 2013 by jm
Parquet
'a columnar storage format that supports nested data', from Twitter and Cloudera, encoded using Apache Thrift in a Dremel-based record shredding and assembly algorithm. Pretty crazy stuff:
We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.

Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.
twitter  cloudera  storage  parquet  dremel  columns  record-shredding  hadoop  marshalling  columnar-storage  compression  data 
march 2013 by jm
Compress data more densely with Zopfli - Google Developers Blog
New compressor from Google, gzip/zip-compatible, slower but slightly smaller results
compression  gzip  zip  deflate  google 
march 2013 by jm
lrzip
'Lrzip uses an extended version of rzip which does a first pass long distance redundancy reduction. The lrzip modifications make it scale according to memory size. [...] The unique feature of lrzip is that it tries to make the most of the available ram in your system at all times for maximum benefit. It does this by default, choosing the largest sized window possible without running out of memory.'
zip  compression  via:dakami  gzip  bzip2  archiving  benchmarks 
february 2012 by jm
snappy - A fast compressor/decompressor
'On a single core of a Core i7 processorin 64-bit mode, it compresses at about 250 MB/sec or more and decompresses atabout 500 MB/sec or more. (These numbers are for the slowest inputs in ourbenchmark suite; others are much faster.) In our tests, Snappy usuallyis faster than algorithms in the same class (e.g. LZO, LZF, FastLZ, QuickLZ,etc.) while achieving comparable compression ratios.'  Apache-licensed, from Google
snappy  google  compression  speed  from delicious
march 2011 by jm
Blosc
A high-performance compressor optimized for binary data -- 'designed to transmit data to the processor cache faster than a traditional, non-compressed, direct memory fetch via memcpy()' (via Bill de hOra)
via:dehora  compression  memcpy  caching  l1  software  memory  optimization  performance  python  pytables  from delicious
october 2010 by jm
Why Our Civilization's Video Art and Culture is Threatened by the MPEG-LA
incredible. Almost every single modern camera capable of recording video now requires that you obtain a license from MPEG-LA to use recorded footage for commercial purposes. These clauses are currently not enforced, but could be. Horrifying (via Tony Finch)
via:fanf  patents  mpeg2  codec  compression  consumer-rights  copyright  legal  law  mpeg  h264  mpegla  codecs  from delicious
may 2010 by jm
Vlnt
'A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on.' UTF8-ish compression, used in Avro
utf8  compression  utf  lucene  avro  hadoop  java  fomats  numeric  from delicious
november 2009 by jm
XZ Utils
15% smaller than bzip, 30% smaller than gzip, and now shipped with Fedora and Ubuntu. uses LZMA2
xz  xzdec  gzip  bzip  compression  lzma  via:wmf  unix  compress  from delicious
october 2009 by jm
pigz
'A parallel implementation of gzip for modern multi-processor, multi-core machines', by Mark Adler, no less
adler  pigz  gzip  compression  performance  concurrency  shell  parallel  multicore  zip  software  from delicious
october 2009 by jm

related tags

adler  algorithms  apache  architecture  archiving  arial  arithmetic-coding  attacks  avro  bandwidth  benchmarks  beringei  binary-tree  bioinformatics  bitmap-indexes  bitmaps  bits  bitset  bitsets  bowtie  brotli  burrows-wheeler  burrows-wheeler-transform  bwt  bzip  bzip2  c++  caching  capn-proto  chunked-encoding  cli  cloudera  codec  codecs  coding  columnar-storage  columns  commandline  compatibility  compress  compressability  compressed-bitmaps  compression  concurrency  consumer-rights  copyright  daniel-lemire  data  data-structures  deflate  dna  documents  dremel  drm  ebay  edit-distance  email  entropy  erasure-coding  facebook  fastbit  fault-tolerance  finite-state-entropy-coding  fm-index  fomats  fonts  formats  frequency-tables  fse  full-text-search  fun  genome  google  gzinga  gzip  h264  hacking  hacks  hadoop  hamming-distance  http  https  huffman  i14y  image-compression  images  indexes  indexing  influxdb  interoperability  jaro-winkler  java  javaewah  jbig2  json  kafka  l1  law  lbzip2  lcs  legal  leveldb  levenshtein-distance  logs  lsm-trees  lucene  lz4  lzma  mailinator  marshalling  memcpy  memory  messaging  mitm  mongodb  monitoring  mp3  mpeg  mpeg2  mpegla  multicore  music  mysql  netflix  netty  networking  nosql  numeric  openmp  ops  optimization  panda  papers  parallel  parquet  patents  performance  photocopying  pigz  pinterest  protobufs  protocols  pytables  python  qemu  qnx  quines  reconstruction  record-shredding  reed-solomon  reliability  replication  reversing  rle  rpc  scaling  scanning  search  security  seeking  sequencing  serialization  sets  shell  similarity  snappy  software  space  spark  speed  splitting  spotify  ssl  storage  streaming  streams  string-matching  strings  succinct  succinct-encoding  sux  suzanne-vega  swiss  symbol-alphabets  tcp  tests  time-series  tls  tools  tries  tsd  tsm-trees  twitter  unix  usenix  utf  utf8  via:dakami  via:dehora  via:fanf  via:hn  via:kragen  via:wmf  video  vm  wah  xerox  xz  xzdec  zip  zlib  zopfli 

Copy this bookmark:



description:


tags: