PDF, Let Me Count the Ways…

Tuesday 29 April 2008

PDF, Let Me Count the Ways…

Filed under: Malware,PDF — Didier Stevens @ 6:21

In this post, I show how basic features of the PDF language can be used to generate polymorphic variants of (malicious) PDF documents. If you code a PDF parser, write signatures (AV, IDS, …) or analyze (malicious) PDF documents, you should to be aware of these features.

Official language specifications are interesting documents, I used to read them from front to back. I especially appreciate the inclusion of a formal language description, for example in Backus–Naur form. But nowadays, I don’t take the time to do this anymore.

While browsing through the official PDF documentation, I took particular interest in the rules to express lexemes. There are many ways to write the same token, offering opportunities to evade known-pattern recognition systems, like AV and NIDS.

Building a test file

Before I show some examples, let’s build a test PDF file that will start the default browser and navigate to a site each time the document is opened.

Opening a web page from a PDF file can be done with an URI action, like this:

This is the same type of object used in the malicious mailto PDF files.

An action must be triggered by an event, examples of such triggers are the association of an action to the display of a page or the opening of the PDF document. We will use the OpenAction to trigger our URI action object each time our test PDF document is opened:

I add the URI action object and the OpenAction event to the hello world PDF file I used in a previous post, to build our test PDF. You can download all examples here. Opening the test PDF document launches IE:

Now that we have our test PDF, let’s look at the ways we can change its representation without changing its rendering. This is what I’m covering (this list is not exhaustive):

Names
- Hexadecimal encoding
Strings
- Newline escaping
- Octal encoding
- Hexadecimal encoding
- Hexadecimal whitespacing
- Encryption

Name representation

The tokens preceded by a / (slash) in the URI action object are called Names in the official PDF description. Names are case-sensitive. The characters used in a Name are limited to a specific set, but since PDF specification version 1.2, a lexical convention has been added to represent a character with its hexadecimal ANSI-code, like this #XX.

This allows use to rewrite the /URI name in several ways, for example: #55RI.

Or #55#52#49

Pattern matching algorithms must take into account these different representations to successfully match a pattern. A standard way to deal with this is canonicalization. First, the token is reduced to a canonical form (e.g. replace all #xx representations by the character they stand for), and second, pattern matching is performed on the canonical form.

String representation

Strings too can be represented in many forms. One way to represent strings, is to type the text between parentheses:

Splitting strings over several lines can be done by adding a backslash (\) at the end of each line:

Of course, we are not limited by the numbers of lines, we can add a backslash after each character:

A character in a string can be represented by its octal code, like this:

And this can be done for every character in the string:

One more way to represent a string, is hexadecimal:

You’re allowed to put whitespace between the hex digits:

And you’re not limited in the amount of whitespace you use:

This whitespace usage reminds me of the IE zero-byte trick in html.

I want to finish this long list of examples with PDF encryption. One more way to change the representation of a PDF document is encryption. PDFs can be encrypted without requiring the user to provide a password to view the encrypted document, this form of encryption is used for DRM. Ever had a PDF with printing or text copy disabled? That’s an encrypted PDF.

When a PDF is encrypted, only the strings and streams are encrypted, the objects themselves are not encrypted. Encrypted strings are one more way to change the representation of a string.

Here’s an example:

I know that PDF encryption has already been used to mislead SPAM filters.

Final thoughts

These many features of the PDF language providing flexibility in representation of names and strings, can also be used to generate polymorphic forms of the same malicious PDF. If you need to scan PDF documents, you need to be aware of all these features and have tools that support them.

There are indications that most AV products don’t canonicalize PDF documents prior to signature matching. I did some tests with a malicious mailto PDF document, and changing the string representation of the mailto URI action using the hexadecimal forms allows AV detection evasion. Adding whitespace wasn’t necessary, switching to hex was enough. The ClamAV source code for PDF documents has more evidence of PDF canonicalization issues in AV software, here is a string compare for the Length name without canicalization:

This will not match if hex codes are used (#).

I tested all my examples with Adobe Acrobat Reader 8.1.2 and Foxit Reader 2.2 without problems. But Foxit Reader 2.2 gave me an unpleasant surprise, more on this in a next post.

I wonder if malicious PDF samples will be used in the Race to Zero.

Comments (19)

19 Comments »

[…] PDF, Let Me Count the Ways… « Didier Stevens […]

Pingback by Interesting Bits - April 29th, 2008 « Infosec Ramblings — Tuesday 29 April 2008 @ 14:26
[…] PDF, Let Me Count the Ways… « Didier Stevens Polymorphism in PDFs to evade signature detection. Neat! (tags: Security) […]

Pingback by McGrew Security Blog » Blog Archive » links for 2008-04-29 — Tuesday 29 April 2008 @ 22:35
[…] of bytes. There is a virtually unlimited number of ways to represent the same byte sequence. After Names and Strings obfuscation, let’s take a look at […]

Pingback by PDF Stream Objects « Didier Stevens — Monday 19 May 2008 @ 6:09
That’s nice, it’s very useful for me, now i’m coding a pdf exploit detector in C++ and this info is greatelly helpfull if you have any other info please send it to my email.
Thanks for the post is brillant!

BR,
Ariel.

Comment by Ariel Liguori — Tuesday 25 November 2008 @ 17:31
[…] On the PDF front: I’ve produced my first Ruby code ;-). I worked together with MC from Metasploit to optimize the PDF generation code in this util.printf exploit module. It uses some obfuscation techniques I described 8 months ago. […]

Pingback by Updates: bpmtk and Hakin9; PDF and Metasploit « Didier Stevens — Tuesday 9 December 2008 @ 21:24
[…] if you want to make it harder to detect, use PDF obfuscation techniques. Or embed the file twice with incremental updates. First version is the file you want to hide, […]

Pingback by Embedding and Hiding Files in PDF Documents « Didier Stevens — Wednesday 1 July 2009 @ 6:28
[…] if you want to make it harder to detect, use PDF obfuscation techniques. Or embed the file twice with incremental updates. First version is the file you want to hide, […]

Pingback by Embedding and Hiding Files in PDF Documents - Opsec — Wednesday 1 July 2009 @ 17:22
[…] if you want to make it harder to detect, use PDF obfuscation techniques. Or embed the file twice with incremental updates. First version is the file you want to hide, […]

Pingback by Abusing PDFs « Security For All — Wednesday 8 July 2009 @ 21:03
[…] another object to a PDF file. So I tried to add a URI with OpenAction similar to Didier in this post. I opened the new file in Preview; absolutely nothing. Knowing that I had to be more thorough, […]

Pingback by Chirashi Security » Malicious PDF files and embedding — Wednesday 15 July 2009 @ 5:35
[…] est intéressant de noter que pdfid supporte l'obfuscation des noms : que le nom de l'objet soit en ASCII, ANSI, hexadécimal ou autre type d'encodage, pdfid […]

Pingback by Les outils d’analyse de PDF « Elevenses blog — Monday 10 May 2010 @ 16:26
Any chance you have a series of benign PDF’s that demonstrate the different types of vulnerabilities you’ve seen malware use (or for that mater even malware infested PDF’s).

Like Ariel, I’m writing something to hopefully help detect this stuff. In my case I’m writing it in PHP to be used with a website that is going to let users upload PDF’s.

Thanks,
Doug

Comment by Doug — Saturday 14 August 2010 @ 18:53
[…] Stevens has a list of malicious PDF obfuscation methods here, for those interested in the […]

Pingback by Analysing a Malicious PDF Document — Saturday 6 November 2010 @ 12:08
[…] peut être intéressant de noter que pdfid supporte l’obfuscation des noms : que le nom de l’objet soit en ASCII, ANSI, hexadécimal ou autre type […]

Pingback by Secur-IT — Thursday 6 January 2011 @ 13:25
[…] lot of analyses from Contagiodump blog 2011 PDF – Vulnerabilities, Exploits and Malwares 2011 PDF, Let Me Count the Ways 2011 Analysing a Malicious PDF Document 2010 The Rise of PDF […]

Pingback by Security PDF-related links in 2010: analyses and tools — Wednesday 10 August 2011 @ 1:25
[…] PDF Parsers might have issues in analyzing following abnormal files: 1. Portable Document File Format does not strictly abide to its specification. 2. PDF Version might be malformed (NULL value, incomplete value etc) (can see in above pic) 3. May not contain endobj or endstream (atleast one string should be present within an object) 4. May not contain xref table 5. Names may be Encoded (/JavaScript as /J#61vaScript). 6. No %%EOF header 7. There might be multiple %%EOF headers or trailer’s indicating incremental updates. 8. PDF embedded within other PDF (same object numbers in a single file). 9. Different types of Evasions/ Encoding can be found at https://blog.didierstevens.com/2008/04/29/pdf-let-me-count-the-ways/ […]

Pingback by Malicious PDF: Portable Document Files Compresion/Encoding/Obfuscation | Total Software & Hardware Solution — Sunday 21 October 2012 @ 13:33
[…] Be mindful of obfuscation with hex codes, such as “/JavaScript” vs. “/J#61vaScript”. (See examples) […]

Pingback by Analyzing Malicious Documents Cheat Sheet — Tuesday 27 January 2015 @ 18:10
[…] Be mindful of obfuscation with hex codes, such as “/JavaScript” vs. “/J#61vaScript”. (See examples) […]

Pingback by Analyzing Malicious Documents Cheat Sheet | iTeam Developers — Monday 2 February 2015 @ 7:32
[…] is even kewler, as described by Didier Stevens in his blog post and Julia Wolf in her CCC talk, is the permissiveness of the PDF standard! It will read malformed […]

Pingback by Flare-On – Challenge 4 | 0x44696f21 – A Techy Journey — Sunday 19 April 2015 @ 11:08
[…] The Suricata detection engine supports rules written in the embeddable scripting language Lua. In this post we give a PoC Lua script to detect PDF documents with name obfuscation. […]

Pingback by Developing complex Suricata rules with Lua – part 1 | NVISO LABS – blog — Friday 10 March 2017 @ 7:46

RSS feed for comments on this post. TrackBack URI

Didier Stevens

Tuesday 29 April 2008