Get angry again (Unicode edition)

bri hefele, 2018-10-23, communication, trans

So, it’s a bit of a recurring theme that this administration makes some horrifying attack on some marginalized group and I feel the need to make some brief post here angrily tossing out organizations worth donating to. Of course, the topic this week is a series of actions threatening trans people¹ and hearkening back to the 1933 burning of the archives of the Institut für Sexualwissenschaft. I’m personally feeling less and less in control of how I’m handling the erosion of civil liberties, and part of me right now needs to write, beyond a brief scream into the ether. So here’s what this post is: if anything on this site has ever had any value to you, please just roll 1D10 and donate to:

…and with that out of the way, for the sake of my own mental health, I’m going to quasi-continue my last post with a bit of binary-level explanation of text file encodings, with emphasis on the Unicode Transformation Formats (UTFs).

⚧ rights are 👤 rights!

…is a topical message made succinct via the vast character repertoire of Unicode. Note that if the above looks like ‘� rights are � rights!’, the first potentially unsupported character should be the transgender symbol and the second should be the human bust in silhouette emoji. These are Unicode code points 26A7 and 1F464, respectively. This is important: every other character falls under the scope of ASCII and therefore requires only a single byte. The transgender symbol requires two bytes, and the emoji requires three. So let’s see how this plays out.

All the sample hex dumps that follow were output from xxd, which uses a period (.) in the (right-hand side) ASCII display to represent non-ASCII bytes. In the text encodings that don’t support two- or three-byte code points, I have replaced these with an asterisk (*, hex 2A) prior to writing/dumping. ASCII is one such encoding – it supports neither character. So, let’s take a look at our string, ‘* rights are * rights!’:

ASCII
00000000: 2A 20 72 69 67 68 74 73 20 61 72 65 20 2A        * rights are *
0000000e: 20 72 69 67 68 74 73 21 0A                        rights!.

Presumably this is obvious, but ASCII has a very limited character repertoire. In reality a 7-bit encoding, ASCII at least had the very important role of being an early standardized encoding, which was great! Before ASCII², any given system’s text encoding was likely incompatible with any other’s. This kind of fell apart when localizations required larger character repertoires, and the eighth bit was used for any number of Extended ASCII encodings. Because ASCII and a number of Extended ASCII encodings standardized under ISO 8859³ were so widely used, and are still so widely used, backward-compatibility remains important. In a very loose sense, Unicode could be seen as an extension onto ASCII – the first (U0000) section of code is ASCII exactly. So, ASCII is limited by 7-bits, various Extended ASCIIs are limited to one byte, what does our byte stream look like if we open this up to two bytes per character?

The Universal Coded Character Set

UCS-2
00000000: 26 A7 00 20 00 72 00 69 00 67 00 68 00 74        &.. .r.i.g.h.t
0000000e: 00 73 00 20 00 61 00 72 00 65 00 20 00 2A        .s. .a.r.e. .*
0000001c: 00 20 00 72 00 69 00 67 00 68 00 74 00 73        . .r.i.g.h.t.s
0000002a: 00 21 00 0A                                      .!..

UCS-2 is about the most straightforward way to expand the character repertoire to 65,355 characters. Every single character is given two bytes, which means suddenly we can use our transgender symbol (26 A7), and all of our ASCII symbols now essentially have a null byte in front of them (00 72 for a lowercase r). There are a lot of 00s in that stream. xxd shows us an ampersand toward the beginning, since 26 is the ASCII code point for &. xxd throws up dots for all the null bytes. Unicode 11.0’s repertoire contains 137,439 characters, a number greater than 65,355. Our emoji, as mentioned, sits at code point 1F464, beyond the FFFF supported by UCS-2 (and therefore replaced with an asterisk above). We can, however, encode the whole string with UCS-4:

UCS-4
00000000: 00 00 26 A7 00 00 00 20 00 00 00 72 00 00        ..&.... ...r..
0000000e: 00 69 00 00 00 67 00 00 00 68 00 00 00 74        .i...g...h...t
0000001c: 00 00 00 73 00 00 00 20 00 00 00 61 00 00        ...s... ...a..
0000002a: 00 72 00 00 00 65 00 00 00 20 00 01 F4 64        .r...e... ...d
00000038: 00 00 00 20 00 00 00 72 00 00 00 69 00 00        ... ...r...i..
00000046: 00 67 00 00 00 68 00 00 00 74 00 00 00 73        .g...h...t...s
00000054: 00 00 00 21 00 00 00 0A                          ...!....

…even more 00s, as every character now gets four bytes. Our transgender symbol lives on as 00 00 26 A7, our ASCII characters have three null bytes (00 00 00 72), and we can finally encode our emoji: 00 01 F4 64. You’ll see an errant d in the ASCII column, that’s xxd picking up on the 64 byte from the emoji. These two- and four-byte versions of the Universal Coded Character Set (UCS) are very straightforward, but not very efficient. If you think you might need to use characters above the FFFF range, suddenly every character you type requires four bytes – if this was for the sake of a single character, your filesize could nearly double. It could nearly quadruple if the majority of your file was characters from ASCII. So the better way to handle this is with the Unicode Transformation Formats (UTFs).

The Unicode Transformation Formats

UTF-8
00000000: E2 9A A7 20 72 69 67 68 74 73 20 61 72 65        ... rights are
0000000e: 20 F0 9F 91 A4 20 72 69 67 68 74 73 21 0A         .... rights!.

UTF-8 is essentially the standard text encoding these days. Both the World Wide Web Consortium and the Internet Mail Consortium recommend UTF-8 as the standard encoding. It starts with the 7-bit ASCII set, and starts setting high bits for multi-byte characters. In a multi-byte character, the first byte starts with binary 110, 1110, or 11110, depending on how many bytes follow (one, two, or three, respectively). These bytes all begin with 10. Our transgender symbol requires three bytes: E2 9A A7. The A7 is familiar as the end of the codepoint, 26A7, but the first two bytes are not recognizable because of the above scheme.

If we break 26A7 into 4-bit binary words, we get…

2    6    A    7
0010 0110 1010 0111

…and E29AA7 is…

E    2    9    A    A    7
1110 0010 1001 1010 1010 0111

…E is our 1110 that signifies that the next two bytes are part of the same character. The next four bits are the beginning of our character, the 2 or 0010. The two following bytes are made up of two 10 bits, and six bits of code point information, so effectively our 26A7 is actually broken up like…

2    6/A…   …A/7
0010 011010 10011

…and we see that in reality, it was mere coincidence that our three-byte version ended in A7. The 7 is a given, but the A happened by chance. UTF-8 is a great format as far as being mindful of size is concerned, but it’s less than ideal for a user who needs to examine a document at the byte level. While code point 26A7 will always translate to E29AA7, a whole second mapping is needed, and the variable byte size per character means that a hex editor’s word size can’t be set to correspond directly to a character. At least it’s fairly easy to suss out at the binary level. UTF-16 looks like:

UTF-16
00000000: 26 A7 00 20 00 72 00 69 00 67 00 68 00 74        &.. .r.i.g.h.t
0000000e: 00 73 00 20 00 61 00 72 00 65 00 20 D8 3D        .s. .a.r.e. .=
0000001c: DC 64 00 20 00 72 00 69 00 67 00 68 00 74        .d. .r.i.g.h.t
0000002a: 00 73 00 21 00 0A                                .s.!..

UTF-16 is used internally at the OS level a lot, and fortunately doesn’t really make its way to end-users much. We can see that our transgender symbol, 26 A7 comes out unscathed since it takes only two bytes. Our emoji shows up as D8 3D DC 64, and the way we get there is very convoluted. First, UTF-16 asks that we subtract (hex) 10000 from our code point, giving us F464. We pad this so that it’s twenty bits long, and break it into two ten-bit words. We then add hex D800 to the first and DC00 to the second:

Original: F4         64
Ten-bit:  0000111101 0001100100
Hex:      003D       0064
Plus:     D800       DC00
Equals:   D83D       DC64

This has the same human-readability issues as UTF-8, and wastes a lot of bytes in the process. Next up would be UTF-32, but seeing as that puts us in four-bytes-per-character territory… It is functionally identical to UCS-4 above⁴.

In closing

All of this information is readily available elsewhere, notably in Chapter 2, Section 5 of The Unicode Standard. I haven’t seen a great side-by-side comparison of UCS and UTF formats at the byte level before, with a focus on how binary data lines up with Unicode code points. UTF-8 is the ‘gold standard’ for good reason – it allows the entire character repertoire to be represented while requiring the least amount of data. However, there are times when it’s necessary to examine text at the binary level, and for a human, this is much easier accomplished by reëncoding the text as UCS-4/UTF-32 and setting a 32-bit word size in your hex editor.

If you’ve made it this far into a post about turning numbers into letters, I have one more thing to say… Please get out and vote, eligible American citizens. Our civil liberties are at stake.

When I started writing this post, there was ‘just’ the leaked memo, and the longer I took, the more attacks they piled on. It’s beyond cruel. ↩︎
There were other standards before ASCII, notably IBM’s EBCDIC. This was certainly not as widely-supported as ASCII, nor is it widely used today. ↩︎
The ISO 8859 standards are replicated under a number of publicly-available ECMA standards as well: ECMA-94, ECMA-113, ECMA-114, ECMA-118, ECMA-121, ECMA-128, and ECMA-144. ↩︎
I think historically there have been some differences between UCS-4 and UTF-32, like UTF-32 being stricter about code point boundaries. However, Appendix C, Section 2 of The Unicode Standard states that “[UCS-4] is now treated simply as a synonym for UTF-32, and is considered the canonical form for representation of characters in [ISO/IEC] 10646.” ↩︎