Quickpost: About the Physical and Logical Structure of PDF Files

Wednesday 9 April 2008

Quickpost: About the Physical and Logical Structure of PDF Files

Filed under: PDF,Quickpost — Didier Stevens @ 6:57

Here is a post to explain in detail PDF polymorphism mentioned in my BH post.

This is a simple “Hello World”-PDF viewed with a text editor:

It is composed of:

a header
a list of objects
a cross reference table
a trailer

What I describe here is the physical structure of a PDF file. The header identifies that this is a PDF file (specifying the PDF file format version), the trailer points to the cross reference table (starting at byte position 642 into the file), and the cross reference table points to each object (1 to 7) in the file (byte positions 12 through 518). The objects are ordered in the file: 1, 2, 3, 4, 5, 6 and 7.

The logical structure of a PDF file is an hierarchical structure, the root object is identified in the trailer. Object 1 is the root, object 2 and 3 are children of object 1, etc…, giving this logical structure:

The physical structure of a PDF file can be transformed into another physical structure, without changing the logical structure. Here is the same file, but now the objects are ordered from 7 to 1 (I reversed the order in which the objects appear in the file):

I also had to update the cross reference table, because each object is located at a different position now. But apart from that, nothing has changed. The root is still object 1, and the tree is the same. In other words, the logical structure of the file remained unchanged, which implies that the rendering of both PDF files is identical. Objects can appear at random positions in a PDF file without impact on the logical file structure (i.e. rendering). For this simple file, with 7 objects, I have ~~5020~~ 5040 (that’s 7!) possible physical structures, just by reordering the objects. And reordering objects is just one way to mutate the physical structure of a PDF file.

You can download both PDF files here.

Quickpost info

Comments (30)

30 Comments »

[…] add the URI action object and the OpenAction event to the hello world PDF file I used in a previous post, to build our test PDF. You can download all examples here. Opening the test PDF document launches […]

Pingback by PDF, Let Me Count the Ways… « Didier Stevens — Tuesday 29 April 2008 @ 6:22
[…] indirect object is all I have to include in my basic PDF document to get a PoC PDF document to crash Adobe Acrobat Reader […]

Pingback by Quickpost: /JBIG2Decode Essentials « Didier Stevens — Monday 2 March 2009 @ 23:12
hello,

thanks for the nice description of the pdf format; one question: how to insert some text that is positioned at some angle relative to the horizontal; for example the entire text-box should be at 45 degrees …

Comment by iovanalex — Tuesday 31 March 2009 @ 10:04
I have no idea, you’ll have to look that up in the PDF reference document. I don’t have PDF expertise, only malicious PDF expertise 😉

Comment by Didier Stevens — Tuesday 31 March 2009 @ 10:44
thanks,
what do you mean by “pdf reference document” ? do you have some links ?

Comment by iovanalex — Tuesday 31 March 2009 @ 17:10
http://tinyurl.com/c2c7sy 😉

Comment by Didier Stevens — Tuesday 31 March 2009 @ 17:21
[…] Malformed PDF Documents Filed under: Malware, My Software, PDF — Didier Stevens @ 7:55 For the sake of this post, I consider a PDF document malformed when it doesn’t observe the basic structure of a PDF document. […]

Pingback by Malformed PDF Documents « Didier Stevens — Thursday 14 May 2009 @ 7:55
I am designing a tool which would extract all the comments related information from a pdf file like the creator of the comment, date and the note..
Can ne one help me like how can i extract the comments from a pdf file.

Comment by saurav — Thursday 14 May 2009 @ 17:35
I guess you mean meta-data, the thing you see in the properties dialog of a PDF document? And not the comments reviewers add to a PDF document?

Comment by Didier Stevens — Thursday 14 May 2009 @ 19:15
[Security]2009年10月Gumblar亜種（仮）が悪用している脆弱性を調べてみた…

2009年10月下旬に確認された Gumblaer 亜種（仮）が悪用する脆弱性は、次の 4 つが確認されています。・Adobe Reader の脆弱性・Adobe Flash Player の脆弱性・Microsoft Office Web コンポーネントの脆弱性（MS09-043）・Internet Explorer 7 の脆弱性（MS09-002） …

Trackback by 思い立ったら書く日記 — Sunday 25 October 2009 @ 2:34
[…] objets possédant chacun un identificateur numérique unique. ( Pour plus d'informations c'est ici) Le premier outil que nous allons voir est pdfid. Son fonctionnement est très simple tout comme […]

Pingback by Les outils d’analyse de PDF « Elevenses blog — Monday 10 May 2010 @ 14:54
On my calculator 7! is 5040, not 5020…

Comment by Oxygenator — Wednesday 29 December 2010 @ 3:15
@Oxygenator I suspect you know 7! out of your head, that you don’t need a calculator 😉

Comment by Didier Stevens — Wednesday 29 December 2010 @ 14:27
[…] possédant chacun un identificateur numérique unique. ( Pour plus d’informations c’est ici). Le premier outil que nous allons voir est pdfid. Son fonctionnement est très simple tout comme […]

Pingback by Secur-IT — Thursday 6 January 2011 @ 13:56
When I opened “hello-world.pdf”, the text displayed ok, and everything was fine except for an error message :
“cette page contient une erreur. Acrobat risque de ne pas afficher cette page correctement. Contactez l’auteur du document PDF pour résoudre le problème.”
(which means that there’s an error )
(I use adobe reader 9).

Do you know what’s missing ? Do you think you’ll post an updated “hello world” ?

Thanks for your very interesting post.

Comment by tintin — Tuesday 22 February 2011 @ 9:24
@tintin I noticed a small error and fixed it. The length in object 5 should be 48 and not 67.

Comment by Didier Stevens — Wednesday 23 February 2011 @ 12:24
I tried crafting a basic document by hand as you explained, but it only shows up a blank page.
I tested the document with pdfXchange viewer, Document Viewer 2.32.0 (Ubuntu) and Adobe ReaderX.
I tried to create it using vim and notepad++.
Do I have to use a special encoding or something? Could you please give me a hint.
thx

Comment by shredit — Monday 14 March 2011 @ 12:24
@shredit Did you download my demo PDFs?

Comment by Didier Stevens — Thursday 17 March 2011 @ 15:18
[…] it so that I could gain a better understanding of the PDF document. The example PDF is taken from a simpler explanation by Didier Stevens. The rest of the details are filled in by the Adobe PDF Specification. I must admit that much of […]

Pingback by Anatomy of a PDF document | amccormack.net — Sunday 22 January 2012 @ 13:42
[…] v0.3.9 (Download) This tool will parse a PDF document to identify the fundamental elements used in the analyzed […]

Pingback by IT Vulnerability & ToolsWatch | PDF Tools (Black Hat EU 2012 Edition) Released — Friday 16 March 2012 @ 13:01
i want to know that is the logical structure of many PDF fles will be same? can we trust the herarchichal way of finding malwares in PDf?

Comment by n2 — Sunday 30 March 2014 @ 1:51
@n2
1) no
2) please elaborate, what is “the herarchichal way of finding malwares in PDf”?

Comment by Didier Stevens — Sunday 30 March 2014 @ 10:04
i mean relying much on logical structure of a pdf file is trustable?
also i want to know that among the 1024 bytes allocated for the header how much bytes will it take for storing the version number? cozin one of the paper i read it say’s that in this header part the version number can be anywhere and so this palce can be used for data hiding.is this true?

Comment by neethu lakshmi — Monday 31 March 2014 @ 1:09
in the apper i read its given that the logical structure of PDF documets can be alike n so is a good way to find malware with help of link count….this is the aper i read “Detection of Malicious PDF Files Based on Hierarchical Document Structure” . i have these doubts since im doing a project on malware analysis on documents n developing a tool.

Comment by neethu lakshmi — Monday 31 March 2014 @ 1:12
[…] Para tener una visión rápida de la estructura de un archivo PDF, tenéis otro artículo de Didier Stevens titulado “About the Physical and Logical Structure of PDF Files”. […]

Pingback by Análisis de PDF sospechosos | Security Art Work — Monday 31 March 2014 @ 14:09
@Neethu I skimmed through that paper. It looks like they have another definition for the logical structure. Are you a CS student?

Comment by Didier Stevens — Friday 4 April 2014 @ 15:45
[…] Physical and Logical Structure of PDF Files […]

Pingback by Analyzing Malicious Documents Cheat Sheet — Tuesday 27 January 2015 @ 18:12
[…] you are like me, then you probably need to read up on how PDFs are actually structured. This article, written by Didier Stevens, describes the basic structure of a PDF file. In principle the PDF file […]

Pingback by Flare-On – Challenge 4 | 0x44696f21 – A Techy Journey — Sunday 19 April 2015 @ 11:08
[…] Before we start analyzing the sample, it will be useful to gain a high-level understanding of the PDF file format specification. The file format description below is intended to provide a brief overview/refresher. If needed, please reference the Adobe PDF file format specification for a more thorough understanding. Also, Didier Stevens provides an extremely good description of the Physical and Logical Structure of PDF Files. […]

Pingback by A Guided Tour of a Classic PDF-Based Malware Dropper – Kevin Douglas — Tuesday 28 February 2017 @ 1:49
[…] Quickpost: About the Physical and Logical Structure of PDF Files […]

Pingback by Checking for maliciousness in Acroform objects on PDF files – Furoner.CAT — Wednesday 15 November 2017 @ 15:22

RSS feed for comments on this post. TrackBack URI

Didier Stevens

Wednesday 9 April 2008