jm + utf-8   8

glibc changed their UTF-8 character collation ordering across versions, breaking postgres
This is terrifying:
Streaming replicas—and by extension, base backups—can become dangerously broken when the source and target machines run slightly different versions of glibc. Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on the slave. These indexes are sorted out of order with respect to the strcoll running on the slave. Because postgres is unaware of the discrepancy is uses these "corrupt" indexes to perform merge joins; merges rely heavily on the assumption that the indexes are sorted and this causes all the results of the join past the first poison pill entry to not be returned. Additionally, if the slave becomes master, the "corrupt" indexes will in cases be unable to enforce uniqueness, but quietly allow duplicate values.


Moral of the story -- keep your libc versions in sync across storage replication sets!
postgresql  scary  ops  glibc  collation  utf-8  characters  indexing  sorting  replicas  postgres 
6 weeks ago by jm
Booking.com, MySQL and UTF-8
good preso from Percona Live 2015 on the messiness of MySQL vs UTF-8 and utf8mb4
utf-8  utf8mb4  mysql  storage  databases  slides  booking.com  character-sets 
december 2016 by jm
Dark corners of Unicode
I’m assuming, if you are on the Internet and reading kind of a nerdy blog, that you know what Unicode is. At the very least, you have a very general understanding of it — maybe “it’s what gives us emoji”.

That’s about as far as most people’s understanding extends, in my experience, even among programmers. And that’s a tragedy, because Unicode has a lot of… ah, depth to it. Not to say that Unicode is a terrible disaster — more that human language is a terrible disaster, and anything with the lofty goals of representing all of it is going to have some wrinkles.

So here is a collection of curiosities I’ve encountered in dealing with Unicode that you generally only find out about through experience. Enjoy.
unicode  characters  encoding  emoji  utf-8  utf-16  utf  mysql  text 
september 2015 by jm
minimaxir/big-list-of-naughty-strings
Late to this one -- a nice list of bad input (Unicode zero-width spaces, etc) for testing
testing  strings  text  data  unicode  utf-8  tests  input  corrupt 
august 2015 by jm
iPhone UTF-8 text vulnerability
'Due to how the banner notifications process the Unicode text. The banner briefly attempts to present the incoming text and then "gives up" thus the crash'. Apparently the entire Springboard launcher crashes.
apple  vulnerability  iphone  utf-8  unicode  fail  bugs  springboard  ios  via:abetson 
may 2015 by jm
Avro, mail # dev - bytes and fixed handling in Python implementation - 2014-09-04, 22:54
More Avro trouble with "bytes" fields! Avoid using "bytes" fields in Avro if you plan to interoperate with either of the Python implementations; they both fail to marshal them into JSON format correctly. This is the official "avro" library, which produces UTF-8 errors when a non-UTF-8 byte is encountered
bytes  avro  marshalling  fail  bugs  python  json  utf-8 
march 2015 by jm
tebeka / fastavro / issues / #11 - fastavro breaks dumping binary fixed [4] — Bitbucket
The Python "fastavro" library cannot correctly render "bytes" fields. This is a bug, and the maintainer is acting in a really crappy manner in this thread. Avoid this library
fastavro  fail  bugs  utf-8  bytes  encoding  asshats  open-source  python 
march 2015 by jm
A dive into a UTF-8 validation regexp
Once again, I find myself checking over the UTF-8 validation code in websocket-driver, and once again I find I cannot ever remember how to make sense of this regex that performs the validation. I just copied it off a webpage once and it took a while (and reimplementing UTF-8 myself) to fully understand what it does. If you write software that processes text, you’ll probably need to understand this too.
utf-8  unicode  utf8  javascript  node  encoding  text  strings  validation  websockets  regular-expressions  regexps 
june 2014 by jm

Copy this bookmark:



description:


tags: