lemire/JavaFastPFOR: A simple integer compression library in Java
compression
java
pfor
encoding
integers
algorithms
storage
april 2018 by jm
a library to compress and uncompress arrays of integers very fast. The assumption is that most (but not all) values in your array use much less than 32 bits, or that the gaps between the integers use much less than 32 bits. These sort of arrays often come up when using differential coding in databases and information retrieval (e.g., in inverted indexes or column stores).
Please note that random integers are not compressible, by this library or by any other means. If you ever had the means of systematically compressing random integers, you could compress any data source to nothing, by recursive application of your technique.
This library can decompress integers at a rate of over 1.2 billions per second (4.5 GB/s). It is significantly faster than generic codecs (such as Snappy, LZ4 and so on) when compressing arrays of integers.
The library is used in LinkedIn Pinot, a realtime distributed OLAP datastore. Part of this library has been integrated in Parquet (http://parquet.io/). A modified version of the library is included in the search engine Terrier (http://terrier.org/). This libary is used by ClueWeb Tools (https://github.com/lintool/clueweb). It is also used by Apache NiFi.
april 2018 by jm
Falsehoods Programmers Believe About CSVs
(via Tony Finch)
via:fanf
csv
excel
programming
coding
apis
data
encoding
transfer
falsehoods
fail
rfc4180
january 2017 by jm
Much of my professional work for the last 10+ years has revolved around handing, importing and exporting CSV files. CSV files are frustratingly misunderstood, abused, and most of all underspecified. While RFC4180 exists, it is far from definitive and goes largely ignored.
Partially as a companion piece to my recent post about how CSV is an encoding nightmare, and partially an expression of frustration, I've decided to make a list of falsehoods programmers believe about CSVs. I recommend my previous post for a more in-depth coverage on the pains of CSVs encodings and how the default tooling (Excel) will ruin your day.
(via Tony Finch)
january 2017 by jm
Building the plane on the way up
reed-solomon
encoding
error-correction
voyager
vger
history
space
nasa
probes
signalling
january 2017 by jm
in 1977, Jet Propulsion Lab (JPL) scientists packed a Reed-Solomon encoder in each Voyager, hardware designed to add error-correcting bits to all data beamed back at a rate of efficiency 80 percent higher than an older method also included with Voyager. Where did the hope come in? When the Voyager probes were launched with Reed-Solomon encoders on board, no Reed-Solomon decoders existed on Earth.
january 2017 by jm
BTrDB: Optimizing Storage System Design for Timeseries Processing
may 2016 by jm
interesting, although they punt to Ceph for storage and miss out the chance to make a CRDT
storage
trees
data-structures
timeseries
delta-delta-coding
encoding
deltas
may 2016 by jm
Elias gamma coding
data-structures
algorithms
elias-gamma-coding
encoding
coding
numbers
integers
april 2016 by jm
'used most commonly when coding integers whose upper-bound cannot be determined beforehand.'
april 2016 by jm
Dark corners of Unicode
unicode
characters
encoding
emoji
utf-8
utf-16
utf
mysql
text
september 2015 by jm
I’m assuming, if you are on the Internet and reading kind of a nerdy blog, that you know what Unicode is. At the very least, you have a very general understanding of it — maybe “it’s what gives us emoji”.
That’s about as far as most people’s understanding extends, in my experience, even among programmers. And that’s a tragedy, because Unicode has a lot of… ah, depth to it. Not to say that Unicode is a terrible disaster — more that human language is a terrible disaster, and anything with the lofty goals of representing all of it is going to have some wrinkles.
So here is a collection of curiosities I’ve encountered in dealing with Unicode that you generally only find out about through experience. Enjoy.
september 2015 by jm
tebeka / fastavro / issues / #11 - fastavro breaks dumping binary fixed [4] — Bitbucket
march 2015 by jm
The Python "fastavro" library cannot correctly render "bytes" fields. This is a bug, and the maintainer is acting in a really crappy manner in this thread. Avoid this library
fastavro
fail
bugs
utf-8
bytes
encoding
asshats
open-source
python
march 2015 by jm
A dive into a UTF-8 validation regexp
utf-8
unicode
utf8
javascript
node
encoding
text
strings
validation
websockets
regular-expressions
regexps
june 2014 by jm
Once again, I find myself checking over the UTF-8 validation code in websocket-driver, and once again I find I cannot ever remember how to make sense of this regex that performs the validation. I just copied it off a webpage once and it took a while (and reimplementing UTF-8 myself) to fully understand what it does. If you write software that processes text, you’ll probably need to understand this too.
june 2014 by jm
Simple Binary Encoding
sbe
encoding
protobuf
protocol-buffers
json
messages
messaging
binary
formats
low-latency
martin-thompson
xml
may 2014 by jm
an OSI layer 6 presentation for encoding/decoding messages in binary format to support low-latency applications. [...] SBE follows a number of design principles to achieve this goal. By adhering to these design principles sometimes means features available in other codecs will not being offered. For example, many codecs allow strings to be encoded at any field position in a message; SBE only allows variable length fields, such as strings, as fields grouped at the end of a message.
The SBE reference implementation consists of a compiler that takes a message schema as input and then generates language specific stubs. The stubs are used to directly encode and decode messages from buffers. The SBE tool can also generate a binary representation of the schema that can be used for the on-the-fly decoding of messages in a dynamic environment, such as for a log viewer or network sniffer.
The design principles drive the implementation of a codec that ensures messages are streamed through memory without backtracking, copying, or unnecessary allocation. Memory access patterns should not be underestimated in the design of a high-performance application. Low-latency systems in any language especially need to consider all allocation to avoid the resulting issues in reclamation. This applies for both managed runtime and native languages. SBE is totally allocation free in all three language implementations.
The end result of applying these design principles is a codec that has ~25X greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.
The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.
may 2014 by jm
Simple Binary Encoding
december 2013 by jm
'SBE is an OSI layer 6 representation for encoding and decoding application messages in binary format for low-latency applications.'
Licensed under ASL2, C++ and Java supported.
sbe
encoding
codecs
persistence
binary
low-latency
open-source
java
c++
serialization
Licensed under ASL2, C++ and Java supported.
december 2013 by jm
Transloadit
july 2010 by jm
AWS-based service to resize images, encode video files, extract thumbnails, and store to S3, for use by third-party web apps. Transcoding-as-a-service
encoding
images
s3
media
storage
transcoding
video
converter
fileupload
from delicious
july 2010 by jm
SimpleRip: Ripping/Encoding DVDs to Xvid with Mencoder
june 2010 by jm
good idea -- generate a mencoder command-line using a friendlier Javascript single-page UI (via OMGUbuntu)
via:omgubuntu
avi
mplayer
conversion
divx
encoding
howto
rip
xvid
video
mencoder
from delicious
june 2010 by jm
related tags
algorithms ⊕ apis ⊕ asshats ⊕ avi ⊕ binary ⊕ bugs ⊕ bytes ⊕ c++ ⊕ characters ⊕ codecs ⊕ coding ⊕ compression ⊕ conversion ⊕ converter ⊕ csv ⊕ data ⊕ data-structures ⊕ delta-delta-coding ⊕ deltas ⊕ divx ⊕ elias-gamma-coding ⊕ emoji ⊕ encoding ⊖ error-correction ⊕ excel ⊕ fail ⊕ falsehoods ⊕ fastavro ⊕ fileupload ⊕ formats ⊕ history ⊕ howto ⊕ images ⊕ integers ⊕ java ⊕ javascript ⊕ json ⊕ low-latency ⊕ martin-thompson ⊕ media ⊕ mencoder ⊕ messages ⊕ messaging ⊕ mplayer ⊕ mysql ⊕ nasa ⊕ node ⊕ numbers ⊕ open-source ⊕ persistence ⊕ pfor ⊕ probes ⊕ programming ⊕ protobuf ⊕ protocol-buffers ⊕ python ⊕ reed-solomon ⊕ regexps ⊕ regular-expressions ⊕ rfc4180 ⊕ rip ⊕ s3 ⊕ sbe ⊕ serialization ⊕ signalling ⊕ space ⊕ storage ⊕ strings ⊕ text ⊕ timeseries ⊕ transcoding ⊕ transfer ⊕ trees ⊕ unicode ⊕ utf ⊕ utf-8 ⊕ utf-16 ⊕ utf8 ⊕ validation ⊕ vger ⊕ via:fanf ⊕ via:omgubuntu ⊕ video ⊕ voyager ⊕ websockets ⊕ xml ⊕ xvid ⊕Copy this bookmark: