Binaries and hex editors

Talking about certain files as ‘binaries’ is a funny thing. All files are ultimately binary, after all, it’s just a matter of whether or not a file is encoded as text. Even in the world of text, an editor or viewer needs to know how the text is encoded, what bytes map to what characters. Is a file ASCII, UTF-8, PostScript? Once we know something is text or not text, it’s still likely to be made to the standards of a specific format, lest it be nothing but plain text. Markdown, HTML, even PDF1 are human-readable text to an extent, with rules about how their content is interpreted. A human as well as a web browser knows that a <p> starts a paragraph, and this paragraph continues until a matching </p> is found.

If we open a binary in a text editor, we’ll see a lot of familiar characters, where data happens to coincide with printable ASCII. We’ll also see a lot of gibberish, and in fact some of the characters may cause a terminal to behave erratically. Opening a binary in a hex editor makes a little more sense of it, but still leaves a lot to be answered. In one column, we’ll see a lot of hexadecimal values; in another we’ll see the same sort of gibberish we would have seen in our text editor. In some sort of status display, we’ll also generally see a few more bits of information – what byte we’re on, its hex value, its decimal value, etc. Why would we ever want to do this? Well, among other things, binary file formats have rules as well, and if we know these rules, we can inspect and navigate them much like an HTML file. Take this piece of a PNG file, as it would appear in bvi (my hex editor of choice).

00000000  89 50 4E 47 0D 0A 1A 0A 00 00 00 0D 49 48 44 52 .PNG........IHDR
00000010  00 00 02 44 00 00 01 04 08 06 00 00 00 C9 50 2B ...D..........P+
00000020  AB 00 00 00 04 73 42 49 54 08 08 08 08 7C 08 64 .....sBIT....|.d
00000030  88 00 00 00 09 70 48 59 73 00 00 0B 12 00 00 0B .....pHYs.......
00000040  12 01 D2 DD 7E FC 00 00 00 1C 74 45 58 74 53 6F ....~.....tEXtSo
"ban_ln_560_NLW.png" 14498451 bytes    00000000 10001001 \211 0x89 137 NUL

A brief overview of what we’re looking at here: 5 rows in three columns of data, plus a sixth status row. The three columns in the data rows are what byte the row starts at (think line number), a data stream in hexadecimal, and then the exact same data represented as best as possible in ASCII. Binary data often contains ASCII data, like here for example we can pick out some things like ‘PNG’, ‘IHDR’, ‘sBIT’, ‘pHYs’, and ‘tEXt’. The status row shows, from left to right, the file name, file size in bytes, the byte number (in hex) that the cursor is on, the current byte in binary, in octal, in hexadecimal, in decimal, and finally a better ASCII/etc. representation of the current byte.

Now, what is the data telling us? Well, PNG doesn’t have many rules (at least, not many that concern us). A file starts with an 8 byte string announcing that it’s a PNG. From that point forward, every piece of the file exists in data chunks. A chunk contains 4 bytes announcing its length, a 4 byte chunk type descriptor, the data itself, and a 4 byte checksum (essentially). We can easily go from chunk to chunk by checking a chunk’s length, moving 4 bytes forward over the descriptor, moving forward the number of bytes that we determined the chunk is long, and then moving forward four more bytes over the checksum. This will place us at the start of the next chunk’s 4-byte length. Demonstrated (cursor underlined), starting eight bytes in at our first chunk (l goes right one character; bvi roughly follows vi/vim conventions):

00000000  89 50 4E 47 0D 0A 1A 0A 00 00 00 0D 49 48 44 52 .PNG........IHDR
"ban_ln_560_NLW.png" 14498451 bytes    00000008 00000000 \000 0x00   0 NUL

llll

00000000  89 50 4E 47 0D 0A 1A 0A 00 00 00 0D 49 48 44 52 .PNG........IHDR
"ban_ln_560_NLW.png" 14498451 bytes    0000000B 00001101 \015 0x0D  13  CR

This tells us our first chunk’s length is 0000000D, or 13 bytes2. Skipping ahead 4 bytes (let’s use a shortcut: 4l) to get past our chunk descriptor, IHDR, we get to the beginning of the data:

00000010  00 00 02 44 00 00 01 04 08 06 00 00 00 C9 50 2B ...D..........P+
"ban_ln_560_NLW.png" 14498451 bytes    00000010 00000000 \000 0x00   0 NUL

The descriptors are always strings of 4 ASCII characters3, so they’re easy to pick out. This means, if we add 13 (data) and 4 (checksum) to our current location (00000010, or 16 in decimal), we should end up 4 bytes ahead of something that looks like a descriptor (4 bytes of ASCII).

:33return (: jumps to a byte, in decimal, or hexadecimal with a leading zero)

00000010  00 00 02 44 00 00 01 04 08 06 00 00 00 C9 50 2B ...D..........P+
00000020  AB 00 00 00 04 73 42 49 54 08 08 08 08 7C 08 64 .....sBIT....|.d
:33                                    00000021 00000000 \000 0x00   0 NUL

tab (this moves between the hex & ASCII columns)

00000010  00 00 02 44 00 00 01 04 08 06 00 00 00 C9 50 2B ...D..........P+
00000020  AB 00 00 00 04 73 42 49 54 08 08 08 08 7C 08 64 .....sBIT....|.d
:33                                    00000021 00000000 \000 0x00   0 NUL

4l

00000010  00 00 02 44 00 00 01 04 08 06 00 00 00 C9 50 2B ...D..........P+
00000020  AB 00 00 00 04 73 42 49 54 08 08 08 08 7C 08 64 .....sBIT....|.d
:33                                    00000025 01110011 \163 0x73 115 's'

…and so on – just as we can find where a <p> element ends by searching for its </p>, we can navigate a PNG chunk by chunk by figuring out a chunk’s size, and then jumping past that chunk to the next one4. This all seems pretty pointless – after all, you just end up with a bunch of additional binary data in the end. But this level of deep inspection often comes in handy. Sure, a purpose-built tool (pngcheck) can probably do a better job of tracking down the data you’re looking for. But often, knowledge of a file format combined with a tool to dive into it can shine a very bright light on a very real problem.


  1. PDF is an extreme example, but technically a person could very well sit down and write a PDF file from scratch in a text editor, and while the format’s standard is thousands of pages long and nearly impossible to grok, a human could theoretically read it as well. The (or at least, one) issue is that much of the underlying content will be compressed and therefore once again a binary. ↩︎
  2. IHDR is always the first chunk of a PNG, and it’s always 13 bytes: 4 bytes of width, 4 bytes of height, 1 byte each of bit depth, color type, compression, filter method, and interpolation method. ↩︎
  3. While the descriptors can be anything (so long as they don’t step on any other descriptors’ toes), brilliantly they also have 4 bits of useful information based on each character’s case. Safe/unsafe to copy, public/private, (reserved for future use), and critical/ancillary. ↩︎
  4. In reality a lot of the data in a PNG would be flate-compressed, but still. Even that could be extracted (dd) and further deflated. ↩︎