It’s not wrong that "🤦🏼‍♂️".length == 7


37 bookmarks. First posted by bjtitus 9 days ago.


Detailed discussion of string length and Unicode.
programming  strings  HenriSivonen 
3 days ago by torbiak
Combined Emoji, has different length in utf8, 16 and 32
unicode  programming  strings  emoji  python  via:popular 
7 days ago by rauschen
From time to time, someone shows that in JavaScript the .length of a string containing an emoji results in a number greater than 1 (typically 2) and then proceeds to the conclusion that haha JavaScript is so broken—and is rewarded with many likes. In this post, I will try to convince you that ridiculing JavaScript for this is less insightful than it first appears and that Swift’s approach to string length isn’t unambiguously the best one. Python 3’s approach is unambiguously the worst one, though.
emoji  programming  unicode 
7 days ago by daniil
can you count? Are you sure?
from twitter_favs
8 days ago by scy
But It’s Better that "🤦🏼‍♂️".len() == 17 and Rather Useless that len("🤦🏼‍♂️") == 5
From time to time, someone shows that in JavaScript the .length of a string containing an emoji results in a number greater than 1 (typically 2) and then proceeds to the conclusion that haha JavaScript is so broken—and is rewarded with many likes. In this post, I will try to convince you that ridiculing JavaScript for this is less insightful than it first appears and that Swift’s approach to string length isn’t unambiguously the best one. Python 3’s approach is unambiguously the worst one, though.

What’s Going on with the Title?
"🤦🏼‍♂️".length == 7 evaluates to true as JavaScript (or Java). Let’s try JavaScript console in Firefox:

"🤦🏼‍♂️".length == 7
true
Haha, right? Well, you’ve been told that the Python community suffered the Python 2 vs. Python 3 split, among other things, to Get Unicode Right. Let’s try Python 3:

$ python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> len("🤦🏼‍♂️") == 5
True
>>>
OK, then. Now, Rust has the benefit of learning from languages that came before it. Let’s try Rust:

$ cargo new -q length
$ cd length
$ echo 'fn main() { println!("{}", "🤦🏼‍♂️".len() == 17); }' > src/main.rs
$ cargo run -q
true
That’s better!

What?
The string contains a single emoji consisting of five Unicode scalar values:

UTF-32 code units UTF-16 code units UTF-8 code units UTF-32 bytes UTF-16 bytes UTF-8 bytes
U+1F926 FACE PALM 1 2 4 4 4 4
U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 1 2 4 4 4 4
U+200D ZERO WIDTH JOINER 1 1 3 4 2 3
U+2642 MALE SIGN 1 1 3 4 2 3
U+FE0F VARIATION SELECTOR-16 1 1 3 4 2 3
Total 5 7 17 20 14 17

The string that contains one graphical unit consists of 5 Unicode scalar values. First, there’s a base character that means a person face palming. By default, the person would have a cartoonish yellow color. The next character is an emoji skintone modifier the changes the color of the person’s skin (and, in practice, also the color of the person’s hair). By default, the gender of the person is undefined, and e.g. Apple defaults to what they consider a male appearance and e.g. Google defaults to what they consider a female appearance. The next two scalar values pick a male-typical appearance specifically regardless of font and vendor. Instead of being an emoji-specific modifier like the skin tone, the gender specification uses an emoji-predating gender symbol (MALE SIGN) explicitly ligated using the ZERO WIDTH JOINER with the (skin-toned) face-palming person. (Whether it is a good or a bad idea that the skin tone and gender specifications use different mechanisms is out of the scope of this post.) Finally, VARIATION SELECTOR-16 makes it explicit that we want a multicolor emoji rendering instead of a monochrome dingbat rendering.

Each of the languages above reports the string length as the number of code units that the string occupies. Python 3 strings have (guaranteed-valid) UTF-32 semantics, so the string occupies 5 code units. In UTF-32, each Unicode scalar value occupies one code unit. JavaScript (and Java) strings have (potentially-invalid) UTF-16 semantics, so the string occupies 7 code units. Rust strings are (guaranteed-valid) UTF-8, so the string occupies 17 code units. It is intentional that the phrasing for the Rust case differs from the phrasing for the Python and JavaScript cases. We’ll come to back to that later.

But I Want the Length to Be 1!
There’s a language for that. The following used Swift 4.2.3, which was the latest release when I was researching this, on Ubuntu 18.04:

$ mkdir swiftlen
$ cd swiftlen/
$ swift package init -q --type executable
$ swift package init --type executable
Creating executable package: swiftlen
Creating Package.swift
Creating README.md
Creating .gitignore
Creating Sources/
Creating Sources/swiftlen/main.swift
Creating Tests/
Creating Tests/LinuxMain.swift
Creating Tests/swiftlenTests/
Creating Tests/swiftlenTests/swiftlenTests.swift
Creating Tests/swiftlenTests/XCTestManifests.swift
$ echo 'print("🤦🏼‍♂️".count == 1)' > Sources/swiftlen/main.swift
$ swift run swiftlen 2>/dev/null
true
(Not using the Swift REPL for the example, because it does not appear to accept non-ASCII input on Ubuntu! Swift 5.0.3 prints the same and the REPL is still broken.)

OK, so we’ve found a language that thinks the string contains one countable unit. But what is that countable unit? It’s an extended grapheme cluster. (“Extended” to distinguish from the older attempt at defining grapheme clusters now called legacy grapheme clusters.) The definition is in Unicode Standard Annex #29 (UAX #29).

The Lengths Seen So Far
We’ve seen four different lengths so far:

Number of UTF-8 code units (17 in this case)
Number of UTF-16 code units (7 in this case)
Number of UTF-32 code units or Unicode scalar values (5 in this case)
Number of extended grapheme clusters (1 in this case)
Given a valid Unicode string and a version of Unicode, all of the above are well-defined and it holds that each item higher on the list is greater or equal than the items lower on the list.

More Than One Length per Programming Language
It is not the case that a given programming language has to choose only one of the above. If we run this Swift program:

var s = "🤦🏼‍♂️"
print(s.count)
print(s.unicodeScalars.count)
print(s.utf16.count)
print(s.utf8.count)
it prints:

1
5
7
17
Let’s try Rust with unicode-segmentation = "1.3.0" in Cargo.toml:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
let s = "🤦🏼‍♂️";
println!("{}", s.graphemes(true).count());
println!("{}", s.chars().count());
println!("{}", s.encode_utf16().count());
println!("{}", s.len());
}
The above program prints:

2
5
7
17
That’s unexpected! It turns out that unicode-segmentation does not implement the latest version of the Unicode segmentation rules, so it gives the ZERO WIDTH JOINER generic treatment (break right after ZWJ) instead of the newer refinement in the emoji context.

Let’s try again, but this time with unic-segment = "0.9.0" in Cargo.toml:

use unic_segment::Graphemes;

fn main() {
let s = "🤦🏼‍♂️";
println!("{}", Graphemes::new(s).count());
println!("{}", s.chars().count());
println!("{}", s.encode_utf16().count());
println!("{}", s.len());
}
1
5
7
17
In the Rust case, strings (here mere string slices) know the number of UTF-8 code units they contain. The len() method call just returns this number that has been stored since the creation of the string (in this case, compile time). In the other cases, what happens is the creation of an iterator and then instead of actually examining the values (string slices correspoding to extended grapheme clusters, Unicode scalar values or UTF-16 code units) that the iterator would yield, the count() method just consumes the iterator and returns the number of items that were yielded by the iteration. The count isn’t stored anywhere on the string (slice) afterwards. If we wanted to later know the counts again, we’d have to iterate over the string again.
python3  unicode  programming  emoji  funny  fail 
8 days ago by some_hren
RT : I wrote about string length:
from twitter_favs
8 days ago by briantrice
RT : I wrote about string length:
from twitter
8 days ago by Groxx
How should a language measure the length of a string? Count UTF-8 units? Count UTF-16 units? UTF-32? Grapheme clusters?
via:reddit  languagedesign  unicode  programming 
8 days ago by mcherm
It’s Not Wrong that "🤦🏼‍♂️".length == 7 But It’s Better that "🤦🏼‍♂️".len() == 17 and Rather Useless that len("🤦🏼‍♂️") == 5 From time to time, someone shows that…
from instapaper
9 days ago by badboy
I wrote about string length:
from twitter_favs
9 days ago by romac
I wrote about string length:
from twitter_favs
9 days ago by bjtitus