Extracting JPEGs from PDFs

bri hefele, 2017-04-18, cli, tips and tricks, unix

I’m not really making a series of ‘things your hex editor is good for’, I swear, but one more use case that comes up frequently enough in my day-to-day life is extracting JPEGs from PDF files. This can be scripted simply enough, but I find doing these things manually from time to time to be a valuable learning experience.

PDF is a heck of a file format, but we really only need to know a few things right now. PDFs are made up of objects, and some of these objects (JPEGs included) are stream objects. Stream objects always have some relevant data stored in a thing called a dictionary, and this includes two bits of data we need to get our JPEG: the Filter tells the viewer how to interpret the stream, and the Length tells us how long, in bytes, the data is. The filter for JPEGs is ‘DCTDecode’, so we can open up a PDF in a hex editor (I’ll be using bvi again) and search for this string to find a JPEG. Before we do, one final thing we should know is that streams begin immediately after an End Of Line (EOL) marker following the word ‘stream’. EOL in a PDF should always be two bytes – 0D 0A or CR LF.

/DCTDecodeEnter

00002E80  6C 74 65 72 2F 44 43 54 44 65 63 6F 64 65 2F 48 lter/DCTDecode/H
00002E90  65 69 67 68 74 20 31 31 39 2F 4C 65 6E 67 74 68 eight 119/Length
00002EA0  20 35 35 33 33 2F 4E 61 6D 65 2F 58 2F 53 75 62  5533/Name/X/Sub
00002EB0  74 79 70 65 2F 49 6D 61 67 65 2F 54 79 70 65 2F type/Image/Type/
00002EC0  58 4F 62 6A 65 63 74 2F 57 69 64 74 68 20 31 32 XObject/Width 12
00002ED0  31 3E 3E 73 74 72 65 61 6D 0D 0A FF D8 FF EE 00 1>>stream.......
/DCTDecode                                     00002E85  \104 0x44  68 'D'

This finds the next ‘DCTDecode’ stream object and puts us on that leading ’D’, byte offset 2E85 (decimal 11909) in this instance. Glancing ahead a bit, we can see that the Length is 5533 bytes. If we then search for ‘stream’, (/streamEnter), we’ll be placed at byte offset 2ED3 (decimal 11987). The word ‘stream’ is 6 bytes, and we need to add an additional 2 bytes for the EOL. This means our JPEG data starts at byte offset 11995 and is 5533 bytes long.

How, then, to extract this data? It may not be everyone’s favorite tool, but dd fits the bill perfectly. It allows us to input a file, start at a byte offset, go to a byte offset, and output the resulting chunk of file – just what we want. Assuming our file is ‘test.pdf,’ we can output ‘test.jpg’ like…

dd bs=1 skip=11995 count=5533 if=test.pdf of=test.jpg

bs=1 sets our block size to 1 byte (which is important, dd is largely used for volume-level operations where blocks are larger). skip skips ahead however many bytes, essentially the initial offset. count tells it how many bytes to read. if and of are input and output files respectively. dd doesn’t follow normal Unix flag conventions, there are no prefixing dashes and those equal signs are quite atypical, and dd is quite powerful, so it’s always worth reading the manpage.