jm + encoding   12

lemire/JavaFastPFOR: A simple integer compression library in Java

a library to compress and uncompress arrays of integers very fast. The assumption is that most (but not all) values in your array use much less than 32 bits, or that the gaps between the integers use much less than 32 bits. These sort of arrays often come up when using differential coding in databases and information retrieval (e.g., in inverted indexes or column stores).

Please note that random integers are not compressible, by this library or by any other means. If you ever had the means of systematically compressing random integers, you could compress any data source to nothing, by recursive application of your technique.

This library can decompress integers at a rate of over 1.2 billions per second (4.5 GB/s). It is significantly faster than generic codecs (such as Snappy, LZ4 and so on) when compressing arrays of integers.

The library is used in LinkedIn Pinot, a realtime distributed OLAP datastore. Part of this library has been integrated in Parquet (http://parquet.io/). A modified version of the library is included in the search engine Terrier (http://terrier.org/). This libary is used by ClueWeb Tools (https://github.com/lintool/clueweb). It is also used by Apache NiFi.
compression  java  pfor  encoding  integers  algorithms  storage 
11 days ago by jm
Falsehoods Programmers Believe About CSVs
Much of my professional work for the last 10+ years has revolved around handing, importing and exporting CSV files. CSV files are frustratingly misunderstood, abused, and most of all underspecified. While RFC4180 exists, it is far from definitive and goes largely ignored.

Partially as a companion piece to my recent post about how CSV is an encoding nightmare, and partially an expression of frustration, I've decided to make a list of falsehoods programmers believe about CSVs. I recommend my previous post for a more in-depth coverage on the pains of CSVs encodings and how the default tooling (Excel) will ruin your day.


(via Tony Finch)
via:fanf  csv  excel  programming  coding  apis  data  encoding  transfer  falsehoods  fail  rfc4180 
january 2017 by jm
Building the plane on the way up
in 1977, Jet Propulsion Lab (JPL) scientists packed a Reed-Solomon encoder in each Voyager, hardware designed to add error-correcting bits to all data beamed back at a rate of efficiency 80 percent higher than an older method also included with Voyager. Where did the hope come in? When the Voyager probes were launched with Reed-Solomon encoders on board, no Reed-Solomon decoders existed on Earth.
reed-solomon  encoding  error-correction  voyager  vger  history  space  nasa  probes  signalling 
january 2017 by jm
BTrDB: Optimizing Storage System Design for Timeseries Processing
interesting, although they punt to Ceph for storage and miss out the chance to make a CRDT
storage  trees  data-structures  timeseries  delta-delta-coding  encoding  deltas 
may 2016 by jm
Elias gamma coding
'used most commonly when coding integers whose upper-bound cannot be determined beforehand.'
data-structures  algorithms  elias-gamma-coding  encoding  coding  numbers  integers 
april 2016 by jm
Dark corners of Unicode
I’m assuming, if you are on the Internet and reading kind of a nerdy blog, that you know what Unicode is. At the very least, you have a very general understanding of it — maybe “it’s what gives us emoji”.

That’s about as far as most people’s understanding extends, in my experience, even among programmers. And that’s a tragedy, because Unicode has a lot of… ah, depth to it. Not to say that Unicode is a terrible disaster — more that human language is a terrible disaster, and anything with the lofty goals of representing all of it is going to have some wrinkles.

So here is a collection of curiosities I’ve encountered in dealing with Unicode that you generally only find out about through experience. Enjoy.
unicode  characters  encoding  emoji  utf-8  utf-16  utf  mysql  text 
september 2015 by jm
tebeka / fastavro / issues / #11 - fastavro breaks dumping binary fixed [4] — Bitbucket
The Python "fastavro" library cannot correctly render "bytes" fields. This is a bug, and the maintainer is acting in a really crappy manner in this thread. Avoid this library
fastavro  fail  bugs  utf-8  bytes  encoding  asshats  open-source  python 
march 2015 by jm
A dive into a UTF-8 validation regexp
Once again, I find myself checking over the UTF-8 validation code in websocket-driver, and once again I find I cannot ever remember how to make sense of this regex that performs the validation. I just copied it off a webpage once and it took a while (and reimplementing UTF-8 myself) to fully understand what it does. If you write software that processes text, you’ll probably need to understand this too.
utf-8  unicode  utf8  javascript  node  encoding  text  strings  validation  websockets  regular-expressions  regexps 
june 2014 by jm
Simple Binary Encoding
an OSI layer 6 presentation for encoding/decoding messages in binary format to support low-latency applications. [...] SBE follows a number of design principles to achieve this goal. By adhering to these design principles sometimes means features available in other codecs will not being offered. For example, many codecs allow strings to be encoded at any field position in a message; SBE only allows variable length fields, such as strings, as fields grouped at the end of a message.

The SBE reference implementation consists of a compiler that takes a message schema as input and then generates language specific stubs. The stubs are used to directly encode and decode messages from buffers. The SBE tool can also generate a binary representation of the schema that can be used for the on-the-fly decoding of messages in a dynamic environment, such as for a log viewer or network sniffer.

The design principles drive the implementation of a codec that ensures messages are streamed through memory without backtracking, copying, or unnecessary allocation. Memory access patterns should not be underestimated in the design of a high-performance application. Low-latency systems in any language especially need to consider all allocation to avoid the resulting issues in reclamation. This applies for both managed runtime and native languages. SBE is totally allocation free in all three language implementations.

The end result of applying these design principles is a codec that has ~25X greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.

The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.
sbe  encoding  protobuf  protocol-buffers  json  messages  messaging  binary  formats  low-latency  martin-thompson  xml 
may 2014 by jm
Simple Binary Encoding
'SBE is an OSI layer 6 representation for encoding and decoding application messages in binary format for low-latency applications.'

Licensed under ASL2, C++ and Java supported.
sbe  encoding  codecs  persistence  binary  low-latency  open-source  java  c++  serialization 
december 2013 by jm
Transloadit
AWS-based service to resize images, encode video files, extract thumbnails, and store to S3, for use by third-party web apps. Transcoding-as-a-service
encoding  images  s3  media  storage  transcoding  video  converter  fileupload  from delicious
july 2010 by jm
SimpleRip: Ripping/Encoding DVDs to Xvid with Mencoder
good idea -- generate a mencoder command-line using a friendlier Javascript single-page UI (via OMGUbuntu)
via:omgubuntu  avi  mplayer  conversion  divx  encoding  howto  rip  xvid  video  mencoder  from delicious
june 2010 by jm

Copy this bookmark:



description:


tags: