Didier Stevens

Wednesday 9 April 2008

Quickpost: About the Physical and Logical Structure of PDF Files

Filed under: PDF,Quickpost — Didier Stevens @ 6:57

Here is a post to explain in detail PDF polymorphism mentioned in my BH post.

This is a simple “Hello World”-PDF viewed with a text editor:

It is composed of:

  • a header
  • a list of objects
  • a cross reference table
  • a trailer

What I describe here is the physical structure of a PDF file. The header identifies that this is a PDF file (specifying the PDF file format version), the trailer points to the cross reference table (starting at byte position 642 into the file), and the cross reference table points to each object (1 to 7) in the file (byte positions 12 through 518). The objects are ordered in the file: 1, 2, 3, 4, 5, 6 and 7.

The logical structure of a PDF file is an hierarchical structure, the root object is identified in the trailer. Object 1 is the root, object 2 and 3 are children of object 1, etc…, giving this logical structure:

The physical structure of a PDF file can be transformed into another physical structure, without changing the logical structure. Here is the same file, but now the objects are ordered from 7 to 1 (I reversed the order in which the objects appear in the file):

I also had to update the cross reference table, because each object is located at a different position now. But apart from that, nothing has changed. The root is still object 1, and the tree is the same. In other words, the logical structure of the file remained unchanged, which implies that the rendering of both PDF files is identical. Objects can appear at random positions in a PDF file without impact on the logical file structure (i.e. rendering). For this simple file, with 7 objects, I have 5020 5040 (that’s 7!) possible physical structures, just by reordering the objects. And reordering objects is just one way to mutate the physical structure of a PDF file.

You can download both PDF files here.

Quickpost info


Blog at WordPress.com.