Recently I was tasked with ensuring a two-page document was §508 compliant, something that I do every day. I didn’t really expect any hang-ups; even the most complicated two-page PDF is still only two pages. I got through the first page with ease. Navigating via the Tags panel, as I do, I landed on a table in the second page. Acrobat immediately stopped responding. Frustrating, but Acrobat is not the stablest of software, so I didn’t think much of it. Back into the fray, Acrobat hangs up at the exact same spot. Third time around, I turn off highlighting, and am able to start unfolding the table…
<Table> <TR> <Table> <TR> <Table> <TR> <Table> <TR>
…and so on, until I gave up. I folded it back up and ctrl-clicked the disclosure triangle to automatically unfold the entire nest. Acrobat immediately stopped responding. I had a suspicion, and upon opening the file up yet again and changing the topmost
<Not a TR>, I unfolded…
<Table> <Not a TR> <Table> <Not a TR> <Table> <Not a TR> <Table> <Not a TR>
A self-referential nightmare
This was not a table with a lot of nested elements. It was, in fact, an infinite table. Essentially a form of resource bomb1, when Acrobat is required to do anything that actually requires traversing this nest of tags, it will never stop. Fortunately, with highlighting off and a bit of luck, I was able to delete the tag from within Acrobat. What if this wasn’t the case?
I’ve mentioned before that PDFs are, ultimately, plaintext files that are at least somewhat human-readable. Generally speaking, of course, this is not the case – exporters typically Flate- or LZW-compress every otherwise uncompressed stream in a given PDF. Fortunately, Acrobat can undo this, spitting out a thoroughly uncompressed PDF for us to examine. There are going to be some odd header bytes, and there’s likely other binary data (JPEGs, for instance), so I recommend doing this in a hex editor.
Let’s talk about objects
Before we actually look at the troublesome bit of the file in question, we need to be a little bit comfortable with the way objects are formatted. We’re dealing with indirect objects for the purpose of this post, enclosed within the delimiters
x y obj␍ and
x is a unique ID and
y is a generational identifier (likely 0). Within this, we care about a few things:
x y R is a reference to the object with aforementioned identifiers
K is a single object or array of objects that are children of the current object, and
P is the parent object. With that in mind, let’s look at our little troublemaker (highlighted for clarity):
189 0 obj␍<</A 190 0 R/K[76 0 R 191 0 R 192 0 R 193 0 R 194 0 R 195 0 R 196 0 R 197 0 R 198 0 R 199 0 R 200 0 R 201 0 R 202 0 R 203 0 R 204 0 R 205 0 R 206 0 R 207 0 R 208 0 R 209 0 R 210 0 R 211 0 R 212 0 R 213 0 R 214 0 R 215 0 R 216 0 R 217 0 R 218 0 R 219 0 R]/P 191 0 R/S/Table>>␍endobj 191 0 obj␍<</K[189 0 R 189 0 R]/P 189 0 R/S/TR>>␍endobj
In my tree diagrams at the beginning of this post, I failed to note that each
<TR> actually had two copies of the corresponding
<Table> in it. Also, the
<TR> was, in effect, the problem – all of the other references seen in the table above were valid, necessary table rows. Following the highlighting, you can see that Object 189, the
<Table>, has Object 191, the
<TR> as both a child (
K) and a parent (
P). Likewise, Object 191 has Object 189 as both a child and a parent. In my experience, the child is the more pressing matter of the two; it creates a direct path downward into the neverending pit of tables and rows.
Safely fixing the problem
My original question here was how I could have fixed this in my hex editor of choice if I was unable to do so in Acrobat. The simplest way is to leave Object 191 intact, but remove the entire
K section2, yielding
191 0 obj␍<</P 189 0 R/S/TR>>␍endobj. This will leave an empty
<TR> in place of the troublesome one. This can then be deleted from the Tags panel, but before that happens, it’s still possible to get Object 189 (the
<Table>) to cause a hang – changing its
P value accordingly (in this instance it was Object 188) will safeguard against this. Finally, the tag should be easy to find, but if one wants to make it foolproof, we can add a Title to make it readily identifiable. The keyword is
T and strings are delimited in parentheses, so
191 0 obj␍<</P 189 0 R/S/TR/T(BAD!BAD!)>>␍endobj will entitle it ‘BAD!BAD!’.
Can we delete the tag entirely from our hex editor? Absolutely, we just need to take a few extra precautions. We’ll delete the entirety of
191 0 obj␍<</K[189 0 R 189 0 R]/P 189 0 R/S/TR>>␍endobj. We need to have sorted out the appropriate parent object of Object 189 this time, and must set
P x y R accordingly (given ID/generation as
y). Finally, Acrobat will be a bit befuddled if we leave the nonexistent
191 0 R in Object 189’s
K array, so we’ll delete this, yielding
189 0 obj␍<</A 190 0 R/K[76 0 R 192 0 R ….
Either of the above strategies will likely lead to Acrobat thinking it needs to fix the PDF3, but that’s just housekeeping and subsequent saves should patch it right up.
Final questions and thoughts
While the point of this post was largely to illustrate how one can manually edit the raw bits and bobs of a PDF to fix unique problems, two questions remain: where did this infinite loop come from, and why does it matter? The former has an unsatisfactory answer: I don’t know. I exported the PDF myself from a PowerPoint (.pptx) file that I was given. From inside PowerPoint itself, nothing seemed irregular about the table in question. Examining Office XML files is never a clean process, but my cursory glance didn’t see any odd recursion. Regardless, it’s distressing that the Acrobat export plugin for Office would render out such a structure.
As to why it matters, well, for many users it won’t. Tags aren’t really a necessary consideration for a straightforward visual presentation of a PDF4. Acrobat (and most viewers I’ve dealt with) don’t bother parsing the tags unless/until they have to. So, a sighted user opening this file up in Acrobat Reader, scrolling through it, and closing it when done would be entirely oblivious. Nothing about that scenario would trigger crash conditions. However, trying to open the document using a screen reader immediately crashes, as do most attempts to export the file to another format.
A document that crashes when presented with a screen reader is obviously a problem for accessibility. Unfortunately, it’s one that an amateur pass through the document would likely miss. Acrobat’s accessibility checker doesn’t dive deep enough to crash, nor does it manage to find the fatal flaw. I have long believed that accessibility checkers in software provide a false sense of accessibility and generally do more harm than good for this reason. While this particular issue is (hopefully) quite rare, it does reinforce my stance.
- I previously wrote about making bomb-style SVGs. ↩︎
- We can also break the loop by making Object 191’s two children something, anything else. Referencing a nonexistent object is a bit weird, but it works enough to buy us time. ↩︎
- I’m not entirely sure how it determines this. The only references to checksumming I’ve seen in the spec are specific to embedded objects, and I haven’t found any MD5 hashes in files I’ve worked on that Acrobat has felt the need to ‘fix’. It may very well be that there’s just a missing reference somewhere now. Regardless, I’ve never had this manifest as an actual problem. ↩︎
- I was curious about how Firefox would handle the file, as it sort of recreates a PDF in HTML. It doesn’t, however, do this very accurately it seems. Accordingly, it had no problem rendering the table (which it didn’t even render as a table). ↩︎