jm + unicode   16

A Programmer’s Introduction to Unicode – Nathan Reed’s coding blog
Fascinating Unicode details -- a lot of which were new to me. Love the heat map of usage in Wikipedia:
One more interesting way to visualize the codespace is to look at the distribution of usage—in other words, how often each code point is actually used in real-world texts. Below is a heat map of planes 0–2 based on a large sample of text from Wikipedia and Twitter (all languages). Frequency increases from black (never seen) through red and yellow to white.

You can see that the vast majority of this text sample lies in the BMP, with only scattered usage of code points from planes 1–2. The biggest exception is emoji, which show up here as the several bright squares in the bottom row of plane 1.
unicode  coding  character-sets  wikipedia  bmp  emoji  twitter  languages  characters  heat-maps  dataviz 
17 days ago by jm
Google and Monotype launch Noto, an open-source typeface family for all the world’s languages
Great font factoid: 'The name “Noto” comes from the little squares that show when a font is not supported by a computer. This are often referred to as “tofu”, because of their shape, therefore the font is short for No Tofu.'
tofu  fonts  i18n  google  design  typography  unicode 
october 2016 by jm
Dark corners of Unicode
I’m assuming, if you are on the Internet and reading kind of a nerdy blog, that you know what Unicode is. At the very least, you have a very general understanding of it — maybe “it’s what gives us emoji”.

That’s about as far as most people’s understanding extends, in my experience, even among programmers. And that’s a tragedy, because Unicode has a lot of… ah, depth to it. Not to say that Unicode is a terrible disaster — more that human language is a terrible disaster, and anything with the lofty goals of representing all of it is going to have some wrinkles.

So here is a collection of curiosities I’ve encountered in dealing with Unicode that you generally only find out about through experience. Enjoy.
unicode  characters  encoding  emoji  utf-8  utf-16  utf  mysql  text 
september 2015 by jm
Late to this one -- a nice list of bad input (Unicode zero-width spaces, etc) for testing
testing  strings  text  data  unicode  utf-8  tests  input  corrupt 
august 2015 by jm
iPhone UTF-8 text vulnerability
'Due to how the banner notifications process the Unicode text. The banner briefly attempts to present the incoming text and then "gives up" thus the crash'. Apparently the entire Springboard launcher crashes.
apple  vulnerability  iphone  utf-8  unicode  fail  bugs  springboard  ios  via:abetson 
may 2015 by jm
attacks using U+202E - RIGHT-TO-LEFT OVERRIDE
Security implications of in-band signalling strikes again, 43 years after the "Blue Box" hit the mainstream.

Jamie McCarthy on Twitter: ".@cmdrtaco - Remember when we had to block the U+202E code point in Slashdot comments to stop siht ekil stnemmoc?"

See also -- GMail was vulnerable too; and for more inline control chars. has some official recommendations from the Unicode consortium on dealing with bidi override chars.
security  attacks  rlo  unicode  control-characters  codepoints  bidi  text  gmail  slashdot  sanitization  input 
april 2015 by jm
A dive into a UTF-8 validation regexp
Once again, I find myself checking over the UTF-8 validation code in websocket-driver, and once again I find I cannot ever remember how to make sense of this regex that performs the validation. I just copied it off a webpage once and it took a while (and reimplementing UTF-8 myself) to fully understand what it does. If you write software that processes text, you’ll probably need to understand this too.
utf-8  unicode  utf8  javascript  node  encoding  text  strings  validation  websockets  regular-expressions  regexps 
june 2014 by jm
Unchi-kun Candy - Japanese Lucky Poop Candy
What doesn't look like Christmas more than a smiling piece of poop, called unchi in Japanese? Because the shape of unchi looks similar to that of mochi used for shrine offerings, and because the sound "unchi" like the Japanese word for luck, this treat is actually a lucky gift -- at least that is how you can explain yourself when you give it as a gift. Each Unchi-kun comes packed with poop candy, taken out from the bottom. Once finished eating, you can open the slot in the back with a box-cutter and turn it into a bank.

Want one!
unchi-kun  unchi  pile-of-poo  emoji  unicode  cute  funny  japan  j-list  sweets  food  gross  candy 
may 2014 by jm
Shapecatcher: Draw the Unicode character you want!
'This is a tool to help you find Unicode characters. Finding a specific character whose name you don't know is cumbersome. On, all you need to know is the shape of the character!' Handy.
shapes  drawing  unicode  characters  language  recognition  web 
may 2014 by jm
Fake Unicode Consortium
featuring such codepoints as "I USED TO BE A LATIN CAPITAL LETTER K LIKE YOU THEN I TOOK AN ARROW IN THE KNEE", "BACK TO THE FUTURE", "ENTERING HYPERSPACE", "LATIN CAPITAL LETTER Q TAKING A NAP", and "LOVE HOTEL". no wait, that one's real (via Tony Finch, with comments by Michael Everson!)
unicode  humor  codepoints  i18n  fonts  skyrim  hyperspace  funny  via:fanf 
march 2012 by jm
'We are proud to announce the free and open-source, a new method of input for over 100 languages that uses statistical reasoning so that users can type effortlessly in plain ASCII while ultimately producing accurate text. This allows Vietnamese users, for example, to simply type “Moi nguoi deu co quyen tu do ngon luan va bay to quan diem,” which will be automatically corrected to “Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm” after Accentuation. To date, we support four clients: Mozilla Firefox, Perl, Python, and Vim, with more to be added shortly.' cool
accents  language  web-services  typing  text-entry  ascii  unicode  characters  from delicious
december 2010 by jm
UTS #46: Unicode IDNA Compatibility Processing
'Client software, such as browsers and emailers, faces a difficult transition from the version of international domain names approved in 2003 (IDNA2003), to the revision approved in 2010 (IDNA2008). The specification in this document provides a mechanism that minimizes the impact of this transition for client software, allowing client software to access domains that are valid under either system.' wow, this is hairy stuff
idn  unicode  domains  interop  from delicious
october 2010 by jm
Unicode 6.0 released
including PILE OF POO, at codepoint 1F4A9:
pile-of-poo  poo  unicode  funny  emoji  characters  from delicious
october 2010 by jm
Emoji Symbols in Unicode: PILE OF POO
"unchi" / "unchimaaku", a little Emoji icon of a dog turd. unfortunately still in "proposed" status, not yet a Unicode point, boo
poo  funny  emoji  unicode  unchi  shit  from delicious
june 2010 by jm

Copy this bookmark: