So, it’s a bit of a recurring theme that this administration makes some horrifying attack on some marginalized group and I feel the need to make some brief post here angrily tossing out organizations worth donating to. Of course, the topic this week is a series of actions threatening trans people and hearkening back to the 1933 burning of the archives of the Institut für Sexualwissenschaft. I’m personally feeling less and less in control of how I’m handling the erosion of civil liberties, and part of me right now needs to write, beyond a brief scream into the ether. So here’s what this post is: if anything on this site has ever had any value to you, please just roll 1D10 and donate to:
- Trans Lifeline
- National Center for Transgender Equality
- Transgender Law Center
- Transgender Legal Defense & Education Fund
- Sylvia Rivera Law Project
- Trans Justice Funding Project
- Trans Women of Color Collective
- Trans Student Educational Resources
- Lambda Legal
- Southern Poverty Law Center
…and with that out of the way, for the sake of my own mental health, I’m going to quasi-continue my last post with a bit of binary-level explanation of text file encodings, with emphasis on the Unicode Transformation Formats (UTFs).
⚧ rights are 👤 rights!
…is a topical message made succinct via the vast character repertoire of Unicode. Note that if the above looks like ‘� rights are � rights!’, the first potentially unsupported character should be the transgender symbol and the second should be the human bust in silhouette emoji. These are Unicode code points
1F464, respectively. This is important: every other character falls under the scope of ASCII and therefore requires only a single byte. The transgender symbol requires two bytes, and the emoji requires three. So let’s see how this plays out.
All the sample hex dumps that follow were output from
xxd, which uses a period (
.) in the (right-hand side) ASCII display to represent non-ASCII bytes. In the text encodings that don’t support two- or three-byte code points, I have replaced these with an asterisk (
2A) prior to writing/dumping. ASCII is one such encoding – it supports neither character. So, let’s take a look at our string, ‘* rights are * rights!’:
00000000: 2A 20 72 69 67 68 74 73 20 61 72 65 20 2A * rights are *
0000000e: 20 72 69 67 68 74 73 21 0A rights!.
Presumably this is obvious, but ASCII has a very limited character repertoire. In reality a 7-bit encoding, ASCII at least had the very important role of being an early standardized encoding, which was great! Before ASCII, any given system’s text encoding was likely incompatible with any other’s. This kind of fell apart when localizations required larger character repertoires, and the eighth bit was used for any number of Extended ASCII encodings. Because ASCII and a number of Extended ASCII encodings standardized under ISO 8859 were so widely used, and are still so widely used, backward-compatibility remains important. In a very loose sense, Unicode could be seen as an extension onto ASCII – the first (U0000) section of code is ASCII exactly. So, ASCII is limited by 7-bits, various Extended ASCIIs are limited to one byte, what does our byte stream look like if we open this up to two bytes per character?
00000000: 26 A7 00 20 00 72 00 69 00 67 00 68 00 74 &.. .r.i.g.h.t
0000000e: 00 73 00 20 00 61 00 72 00 65 00 20 00 2A .s. .a.r.e. .*
0000001c: 00 20 00 72 00 69 00 67 00 68 00 74 00 73 . .r.i.g.h.t.s
0000002a: 00 21 00 0A .!..
UCS-2 is about the most straightforward way to expand the character repertoire to 65,355 characters. Every single character is given two bytes, which means suddenly we can use our transgender symbol (
26 A7), and all of our ASCII symbols now essentially have a null byte in front of them (
00 72 for a lowercase
r). There are a lot of
00s in that stream.
xxd shows us an ampersand toward the beginning, since
26 is the ASCII code point for
xxd throws up dots for all the null bytes. Unicode 11.0’s repertoire contains 137,439 characters, a number greater than 65,355. Our emoji, as mentioned, sits at code point
1F464, beyond the
FFFF supported by UCS-2 (and therefore replaced with an asterisk above). We can, however, encode the whole string with UCS-4:
00000000: 00 00 26 A7 00 00 00 20 00 00 00 72 00 00 ..&.... ...r..
0000000e: 00 69 00 00 00 67 00 00 00 68 00 00 00 74 .i...g...h...t
0000001c: 00 00 00 73 00 00 00 20 00 00 00 61 00 00 ...s... ...a..
0000002a: 00 72 00 00 00 65 00 00 00 20 00 01 F4 64 .r...e... ...d
00000038: 00 00 00 20 00 00 00 72 00 00 00 69 00 00 ... ...r...i..
00000046: 00 67 00 00 00 68 00 00 00 74 00 00 00 73 .g...h...t...s
00000054: 00 00 00 21 00 00 00 0A ...!....
00s, as every character now gets four bytes. Our transgender symbol lives on as
00 00 26 A7, our ASCII characters have three null bytes (
00 00 00 72), and we can finally encode our emoji:
00 01 F4 64. You’ll see an errant
d in the ASCII column, that’s
xxd picking up on the
64 byte from the emoji. These two- and four-byte versions of the Universal Coded Character Set (UCS) are very straightforward, but not very efficient. If you think you might need to use characters above the
FFFF range, suddenly every character you type requires four bytes – if this was for the sake of a single character, your filesize could nearly double. It could nearly quadruple if the majority of your file was characters from ASCII. So the better way to handle this is with the Unicode Transformation Formats (UTFs).
00000000: E2 9A A7 20 72 69 67 68 74 73 20 61 72 65 ... rights are
0000000e: 20 F0 9F 91 A4 20 72 69 67 68 74 73 21 0A .... rights!.
UTF-8 is essentially the standard text encoding these days. Both the World Wide Web Consortium and the Internet Mail Consortium recommend UTF-8 as the standard encoding. It starts with the 7-bit ASCII set, and starts setting high bits for multi-byte characters. In a multi-byte character, the first byte starts with binary
11110, depending on how many bytes follow (one, two, or three, respectively). These bytes all begin with
10. Our transgender symbol requires three bytes:
E2 9A A7. The
A7 is familiar as the end of the codepoint,
26A7, but the first two bytes are not recognizable because of the above scheme.
If we break
26A7 into 4-bit binary words, we get…
2 6 A 7
0010 0110 1010 0111
E 2 9 A A 7
1110 0010 1001 1010 1010 0111
E is our
1110 that signifies that the next two bytes are part of the same character. The next four bits are the beginning of our character, the
0010. The two following bytes are made up of two
10 bits, and six bits of code point information, so effectively our
26A7 is actually broken up like…
2 6/A… …A/7
0010 011010 10011
…and we see that in reality, it was mere coincidence that our three-byte version ended in
7 is a given, but the
A happened by chance. UTF-8 is a great format as far as being mindful of size is concerned, but it’s less than ideal for a user who needs to examine a document at the byte level. While code point
26A7 will always translate to
E29AA7, a whole second mapping is needed, and the variable byte size per character means that a hex editor’s word size can’t be set to correspond directly to a character. At least it’s fairly easy to suss out at the binary level. UTF-16 looks like:
00000000: 26 A7 00 20 00 72 00 69 00 67 00 68 00 74 &.. .r.i.g.h.t
0000000e: 00 73 00 20 00 61 00 72 00 65 00 20 D8 3D .s. .a.r.e. .=
0000001c: DC 64 00 20 00 72 00 69 00 67 00 68 00 74 .d. .r.i.g.h.t
0000002a: 00 73 00 21 00 0A .s.!..
UTF-16 is used internally at the OS level a lot, and fortunately doesn’t really make its way to end-users much. We can see that our transgender symbol,
26 A7 comes out unscathed since it takes only two bytes. Our emoji shows up as
D8 3D DC 64, and the way we get there is very convoluted. First, UTF-16 asks that we subtract (hex)
10000 from our code point, giving us
F464. We pad this so that it’s twenty bits long, and break it into two ten-bit words. We then add hex
D800 to the first and
DC00 to the second:
Original: F4 64
Ten-bit: 0000111101 0001100100
Hex: 003D 0064
Plus: D800 DC00
Equals: D83D DC64
This has the same human-readability issues as UTF-8, and wastes a lot of bytes in the process. Next up would be UTF-32, but seeing as that puts us in four-bytes-per-character territory… It is functionally identical to UCS-4 above.
All of this information is readily available elsewhere, notably in Chapter 2, Section 5 of The Unicode Standard. I haven’t seen a great side-by-side comparison of UCS and UTF formats at the byte level before, with a focus on how binary data lines up with Unicode code points. UTF-8 is the ‘gold standard’ for good reason – it allows the entire character repertoire to be represented while requiring the least amount of data. However, there are times when it’s necessary to examine text at the binary level, and for a human, this is much easier accomplished by reëncoding the text as UCS-4/UTF-32 and setting a 32-bit word size in your hex editor.
If you’ve made it this far into a post about turning numbers into letters, I have one more thing to say… Please get out and vote, eligible American citizens. Our civil liberties are at stake.
Apple recently stirred up a bit of controversy when they revealed that their bagel emoji lacked cream cheese. Which is a ridiculous thing to get salty over, but ultimately they relented and added cream cheese to their bagel. Which should be the end of this post, and then I should delete this post, because none of that matters. But it isn’t the end, because I saw a lot of comments pop up following the redesign that reminded me: people really don’t seem to get how emoji work. Specifically, I saw a lot of things like ‘Apple can fix the bagel, but we still don’t have a trans flag’ or ‘Great to see Apple put cream cheese on the bagel, now let’s get more disability emoji’. Both of those things would, in fact, be great, but they have nothing to do with Apple’s bagel suddenly becoming more edible.
Unicode is, in its own words, “a single universal character encoding [with] extensive descriptions, and a vast amount of data about how characters function.” It maps out characters to code points, and allows me to look up the division sign on a table, find that its code point is
00F7, and insert this into my document: ÷. Transformation formats take on the job of mapping raw bytes into these standardized code points – this blog is written and rendered in the transformation format UTF-8. Emoji are not pictures sent back and forth any more than the letter ‘A’ or the division sign are – they are Unicode code points also, rendered out in a font like any other character. This is why if I go ahead and insert
1F9E5 (🧥), the resulting coat will be wildly different depending upon what system you’re on. If I didn’t specify a primary font for my site, the overall look of this place would be different for different users also, as the browser/OS would have its own idea of a default serif font.
This mapping, these code points, they are defined by The Unicode Consortium. The Consortium takes in proposals, makes decisions on character proposals as well as technical matters, makes drafts, does all sorts of behind-the-scenes junk, and spits out the Unicode Standard. Major revisions to the Unicode Standard then become an ISO Standard (10646 Information Technology — Universal Coded Character Set (UCS)). And while Apple is a voting (full) member of the Consortium, adding new characters (even emoji) is a serious process, much different from having a graphic designer slap some paint on a doughy circle.
“How characters function” is an important aspect of emoji as well. Much like I can use the combining diaresis,
0308 ( ̈), with an ‘a’ to make ‘ä’, combinations of glyphs work to bring skin tones and gender markers to emoji. So, again, when I saw people (in jest, I truly hope) suggesting that Apple allow users to choose their bagel topping much like they would skin tone… well, it’s not a very effective joke when that too is a function of the Unicode Consortium.
Philadelphia Cream Cheese ran a handful of Twitter ads over the whole controversy, and now that the dust has settled, they ran one thanking Apple and the Unicode Consortium, which… is largely wrong in the other direction, since the glyph itself is entirely on Apple. Part of the Unicode Standard, however, is multilingual descriptive text for characters. Emoji are annotated under the CLDR Character Annotations, given a short name as well as comments that may offer other explanations. So,
1F404, COW, is helpfully also described as potentially representing ‘beef (on menus)’, and (likely against the wishes of some members)
1F4A9, PILE OF POO, ‘may be depicted with or without a friendly face’. The notes on bagel do, in fact, suggest that it can represent ‘schmear’, so perhaps in some way Unicode was to thank by subtly suggesting that bagels are canonically coated.
All of this to say that, while Apple and Google are both (among others, of course) high-level members of the Unicode Consortium, it is just that – a consortium of contributors that go through an involved process to create a functional international standard mapping of characters from A-Z to hieroglyphics to the vegetable pictograms we pepper our sexts with. Changing the visual representation of a character in the emoji font is a much less daunting task than changing an ISO standard. Which is why shouting at Apple on Twitter is unlikely to get a trans flag emoji introduced, but submitting a proposal to the Unicode Consortium just might.
A while back, floppy disk enthusiast/archivist, @foone posted about a floppy find, the Alice JPEG Image Compression Software. I suggest reading the relevant posts about the floppy, but the gist is that @foone archived and examined the disk and was left with a bunch of mysterious .CMP files which appeared to have JPEG streams but did not actually function as JPEGs. Rather, they would load but only displayed an odd little placeholder, identical for each file. I know a bit about JPEGs, and decided to have a hand at cracking this nut. The images that resulted were not particularly interesting – this was JPEG compression software from the early ‘90s, clearly targeted at industries that would be storing a lot of images and not home users. The trick to the files, however, was a fun discovery.
The title of this post gives it away, I realize – the real images were effectively ‘commented out’. Here’s a hex dump of the relevant chunk of one of the photos:
(offset) (hex) (ascii)
00000270 F4 F5 F6 F7 F8 F9 FA FF C0 00 11 08 00 3C 00 50 .............<.P
00000280 03 01 21 00 02 11 01 03 11 01 FF DA 00 0C 03 01 ..!.............
00000290 00 02 11 03 11 00 3F 00 FF FE 00 0E 49 4D 41 47 ......?.....IMAG
000002A0 45 20 44 41 54 41 3D 3E C6 B1 3F E8 51 7F BB 52 E DATA=>..?.Q..R
000002B0 13 5E 64 BE 26 7D 65 1F E1 C7 D1 08 EA DB 73 B4 .^d.&}e.......s.
Something sure looks suspicious in that ASCII column, doesn’t it? Let’s talk briefly about JPEG files. JPEGs contain a number of different sorts of data: EXIF/metadata, the Huffman and quantization tables used to compress the image, information about the details of the image (bit depth, dimensions), and the image data itself, to name a few. All of this information is split up into chunks prefixed with a two byte code:
FF followed by another byte that says what the data that follows is. At offset 277, we see
FF C0. This is the start of frame, and the next seventeen bytes tell us (among other things) that it’s an 8-bit/channel color image, 80x60 pixels. At offset 28A, we run into
FF DA, which is the start of the image itself. This only runs for 11 bytes, until we hit
FF FE at offset 298. Those 11 bytes are the odd little placeholder image from above, and
FF FE is, as you can probably guess, a comment.
Comments aren’t that prevalent in JPEGs. JFIF, EXIF, and XMP data are all stored in application-specific data chunks (much like the layer information in Adobe Fireworks PNGs). Comments are typically used to mark what encoder produced the JPEG, and that’s about it. But, much like using comments to soft-delete code, an entire image can be stuffed in there, waiting for a specific decoder (or hex editor user) to erase the placeholder image and the comment prefix. Presumably this is just what the Alice software did: it would find
FF DA, and ignore everything until after
FF FE 00 0E
IMAGE DATA=>. Other decoders would simply ignore the real image, because that’s what the JPEG spec tells them to do.
Not seen above, but necessary for the process to work and interesting to consider is the sequence
FF 00. All markers, including comments, are only terminated upon encountering another
FF byte. Once you get to the compressed image data, you’re likely going to need
FF bytes that aren’t instructions to the decoder. These are essentially escaped by the two-byte sequence
FF 00. The decoder knows that this is not the start of a new chunk, but rather a literal
FF. This works across the board – which means that our commented-out image can (and does) contain several
FF 00 sequences, and the decoder does not interpret this as termination of the comment.
Finally, it’s worth noting that JPEG images end with
FF D9, which are (expectedly) the last two bytes in any of the given .CMP files. The placeholder image doesn’t need its own
FF D9, since the one at the end of the file is the next marker that’s encountered after the comment regardless. In fact, doing so likely would have required additional logic in the Alice placeholder-removal scheme, as you would now have to ignore the end-of-image marker under (exactly) one specific condition on top of everything else.
This is obviously not a robust form of copyright protection, and seemingly lends itself to an inefficient set of Huffman and quantization tables as well. These inefficiencies could likely be handled better by modern encoders designed around needing tables for two images, and it is interesting to think of potential use-cases. One could, theoretically, comment out encrypted image data, while leaving a placeholder image that tells a user as much. Practical? Likely not, but as much as us code nerds take our ability to comment out code for granted, it’s rather fascinating to see the same techniques played out in the binary sphere.
A while back, I wrote lovingly of a sweet little tabletop RPG (TTRPG) called Mirror. Currently, I am in the middle of a campaign of an upcoming (to Kickstarter, October 1) TTRPG by the same author (a personal friend, it’s worth noting), entitled Americana. I have no real desire to discuss the nitty-gritty mechanics of, say, where the dice go and how to use them, but as far as my experience is concerned this all works well. I don’t mean to be dismissive of the gears that make the clock tick – all the little details are incredibly important and difficult to make work. I just don’t think that writing about them is particularly expressive, and Americana has a lot of implementation facets that really make for a compelling experience. These experiential details are what I’d prefer to discuss.
Americana is a murder-mystery. The ultimate goal is to solve the murder of the group’s mutual friend. Thus, while there will be all manner of conflict, the primary driving factor for success is that of deduction. In its default state, it’s a game of deduction defined by play – nobody, not even the game master (GM), knows who the murderer is. For a potentially-relatable board game comparison, it’s less Clue, and more Tobago. By the same tack, the GM doesn’t come to the first session with a fully-developed setting. Instead, the first session is essentially a collaboration between the GM and players to figure out make decisions on everything about the town the characters reside in. Its size, its local hangouts and cliques, the adults who will yell at you for smoking in the alley. It’s an approach that gives the players agency, and therefore makes them more invested in the narrative. This is, perhaps, less important in a lot of TTRPG settings; you don’t necessarily need to be invested in the dungeon you’re slashing your way through, you just need a compelling hook (of course I want to slay Richard the cockatrice and avenge the death of my sister, the statue!). But, for a group of teens to band together in the confines of a single town to sneak around and solve their mutual friend’s murder? The additional level of engagement makes the whole thing more personal.
Character generation is, of course, also done during the first session. To add one more collaborative element to setup, however, the group collectively creates an additional character: the dead friend. Your own personal Laura Palmer gets assigned skills just like any other character, which any player character can make use of throughout the game via a clever flashback mechanism. Largely, however, the dead friend’s sheet is whitespace, to be filled with a Charlie-Kelly’s-mailroom-esque web of characters, places, objects, and their ties to the victim. During setup, it’s another place where players get to create their own investment into the campaign. For the rest of the game, it’s the framework of deduction.
Engaging experiential details like those compel me to step into the world, but there are a handful of less hands-on details that I also feel the need to bring up. For starters, the standard setting of Americana is that sort of Happy Days-esque idealized 1950s America: cool cars, malt shops, and teenage hijinks. While there’s a lot to be said for this aesthetic, (privileged) creators have a tendency to ignore the fact that it was a pretty tough time if you weren’t a straight white cis male. It is a very welcome touch, then, that Americana explicitly says that this is not our 1950s America, but one which developed without the horrible marginalization that still informs our 2018 reality. Of course, the world of Americana has elves and bipedal dragons as well, so it’s not a huge leap to say ‘hey, this world ostensibly resembles an existing world, but is canonically better.’ It’s a simple thing that many creators seem unwilling to do, but it has huge implications as far as diversifying the hobby. I’m lucky enough to game with a bunch of folks who I trust implicitly, but if I were joining a random game night or a session at a con, I’d be a lot less likely to express myself in ways that are honest to myself. Even moreso in a game setting that, if taken at face value, gives players a tidy excuse for harmful behavior.
This conscious, proactive approach to player safety and comfort is extended to the No Card, a bit of paper with a big ol’ “X” on it that every player has on hand. If anything currently being played out is making a player uncomfortable, they can reveal their card and halt the narrative. Theoretically, of course, any player in any game can speak up and announce that they’re not okay with the direction a narrative is going, but in reality that takes a lot more emotional energy than flipping a card – especially when you factor in marginalized folks tending to be taught not to speak over their oppressors. I’ve seen this sort of thing mentioned before as a mechanism to ensure an inclusive environment for any game. Being baked right into the rules immediately sets a positive tone, though, and I think little inclusions like this in rulebooks really have the potential to make the hobby more welcoming.
I don’t necessarily play roleplaying games to win. Winning is (hopefully) very satisfying, of course, but the overall experience is what makes hours of gameplay over weeks of sessions feel like no time at all. Experience encompasses many things, and it comes as much from the GM and players as it does the rules. But a good rulebook lends itself to a good experience, and from ensuring that all the players are engaged and invested in the narrative to establishing basic safety nets and boundaries for players, Americana lays the groundwork for a great experience.
When I first wrote the ‘Solo play’ series, they were basically the top five solo board/card games that I was playing at the time, in order of preference. Adding to this series at this point is just adding more solo games that I love, the order isn’t particularly meaningful anymore.
Beyond nostalgia, I’ve enjoyed a lot of the modern takes on the Choose Your Own… errm… Narrative style of book. Recently, my fellow commuters and I have been laughing and stumbling our way through Ryan North’s 400-page Romeo and/or Juliet, which I highly recommend. There are great independent works up at chooseyourstory.com. It’s an art form that’s alive and well, and has grown beyond the exclusive realm of children. Does a book that you read out of order, and often fail to bring to a happy conclusion count as a game? Does it warrant a post in my ‘Solo play’ series?
Cardventures: Stowaway 52 by Gamewright is a card-based version of the choosable narrative. The premise is something along the lines of being stuck on an alien ship set to destroy Earth. The assumption is that you like Earth, and would therefore like to keep this plan from happening. My initial suspicion was that the thing should’ve just been a book, and that the card-based system was a cost-cutting measure or a gimmick. I was pleasantly surprised to find that I was quite wrong about this.
The card system has a few implications, two of which make it stand out for the better. First, the game instructs you to start with any card in the deck (of 52, in case the name wasn’t telling enough). This is a little bit gimmicky, but it’s oddly charming as well. You pick a card at random, and go from there. The narrative is largely about moving around the ship, and so the cards are almost all just locations, except for those which make for the second neat thing about the card-based approach: items. Ordinarily, as you choose your next node, you discard the current card so you can measure your progress later. But some of the cards are items, which you set aside for further use – some paths can’t be chosen unless you’ve already acquired the necessary item. This could be done in a book, using counters or a notepad or something, but I think it would be very clunky compared to the cards. This is a very clever mechanic that brings the experience slightly closer to a Zork.
Those rather innovative aspects do have some drawbacks. Because there’s no defined beginning, there can’t really be a defined end either. Instead, you go through until you can’t go anywhere (cards always have two choices, and you can’t revisit a card). Every card has a point value, and when you have no more choices, you count up all the points from the cards you made it through, and read one of four endings accordingly. They are… not terribly satisfying, because the game has no real sense of what narrative decisions you took, only that you made it pretty far or not far at all. Likewise, as mentioned, the cards themselves are by necessity basically just locations. This is more satisfying than the issue with the endings, you do get the sense that you’re frantically sneaking about, trying to avoid aliens. But it still lacks the depth that the control afforded by a slightly more linear system would have.
Cardventures: Stowaway 52 is a novel approach to the choosable narrative concept. Gamewright apparently has a second entry in the series, Jump Ship, which I look forward to trying at some point. In my first run-through of Stowaway 52, I managed to get over 200 points (the maximum is 300, and hitting this is the only way to get the winning ending). Even though the narrative itself was kind of thin, moving through all the bits of the ship and grabbing items was pretty satisfying. Reaching a node where I needed one of the items I had was very satisfying. Enough so that I think it deserves a write-up in my Solo Play series, apparently.