jm + utf8   3

A Branchless UTF-8 Decoder
This week I took a crack at writing a branchless UTF-8 decoder: a function that decodes a single UTF-8 code point from a byte stream without any if statements, loops, short-circuit operators, or other sorts of conditional jumps. [...] Why branchless? Because high performance CPUs are pipelined. That is, a single instruction is executed over a series of stages, and many instructions are executed in overlapping time intervals, each at a different stage.


Neat hack (via Tony Finch)
algorithms  optimization  unicode  utf8  branchless  coding  c  via:fanf 
10 weeks ago by jm
A dive into a UTF-8 validation regexp
Once again, I find myself checking over the UTF-8 validation code in websocket-driver, and once again I find I cannot ever remember how to make sense of this regex that performs the validation. I just copied it off a webpage once and it took a while (and reimplementing UTF-8 myself) to fully understand what it does. If you write software that processes text, you’ll probably need to understand this too.
utf-8  unicode  utf8  javascript  node  encoding  text  strings  validation  websockets  regular-expressions  regexps 
june 2014 by jm
Vlnt
'A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on.' UTF8-ish compression, used in Avro
utf8  compression  utf  lucene  avro  hadoop  java  fomats  numeric  from delicious
november 2009 by jm

Copy this bookmark:



description:


tags: