In this post, I show how basic features of the PDF language can be used to generate polymorphic variants of (malicious) PDF documents. If you code a PDF parser, write signatures (AV, IDS, …) or analyze (malicious) PDF documents, you should to be aware of these features.
Official language specifications are interesting documents, I used to read them from front to back. I especially appreciate the inclusion of a formal language description, for example in Backus–Naur form. But nowadays, I don’t take the time to do this anymore.
While browsing through the official PDF documentation, I took particular interest in the rules to express lexemes. There are many ways to write the same token, offering opportunities to evade known-pattern recognition systems, like AV and NIDS.
Building a test file
Before I show some examples, let’s build a test PDF file that will start the default browser and navigate to a site each time the document is opened.
Opening a web page from a PDF file can be done with an URI action, like this:
This is the same type of object used in the malicious mailto PDF files.
An action must be triggered by an event, examples of such triggers are the association of an action to the display of a page or the opening of the PDF document. We will use the OpenAction to trigger our URI action object each time our test PDF document is opened:
I add the URI action object and the OpenAction event to the hello world PDF file I used in a previous post, to build our test PDF. You can download all examples here. Opening the test PDF document launches IE:
Now that we have our test PDF, let’s look at the ways we can change its representation without changing its rendering. This is what I’m covering (this list is not exhaustive):
- Hexadecimal encoding
- Newline escaping
- Octal encoding
- Hexadecimal encoding
- Hexadecimal whitespacing
The tokens preceded by a / (slash) in the URI action object are called Names in the official PDF description. Names are case-sensitive. The characters used in a Name are limited to a specific set, but since PDF specification version 1.2, a lexical convention has been added to represent a character with its hexadecimal ANSI-code, like this #XX.
This allows use to rewrite the /URI name in several ways, for example: #55RI.
Pattern matching algorithms must take into account these different representations to successfully match a pattern. A standard way to deal with this is canonicalization. First, the token is reduced to a canonical form (e.g. replace all #xx representations by the character they stand for), and second, pattern matching is performed on the canonical form.
Strings too can be represented in many forms. One way to represent strings, is to type the text between parentheses:
Splitting strings over several lines can be done by adding a backslash (\) at the end of each line:
Of course, we are not limited by the numbers of lines, we can add a backslash after each character:
A character in a string can be represented by its octal code, like this:
And this can be done for every character in the string:
One more way to represent a string, is hexadecimal:
You’re allowed to put whitespace between the hex digits:
And you’re not limited in the amount of whitespace you use:
This whitespace usage reminds me of the IE zero-byte trick in html.
I want to finish this long list of examples with PDF encryption. One more way to change the representation of a PDF document is encryption. PDFs can be encrypted without requiring the user to provide a password to view the encrypted document, this form of encryption is used for DRM. Ever had a PDF with printing or text copy disabled? That’s an encrypted PDF.
When a PDF is encrypted, only the strings and streams are encrypted, the objects themselves are not encrypted. Encrypted strings are one more way to change the representation of a string.
Here’s an example:
I know that PDF encryption has already been used to mislead SPAM filters.
These many features of the PDF language providing flexibility in representation of names and strings, can also be used to generate polymorphic forms of the same malicious PDF. If you need to scan PDF documents, you need to be aware of all these features and have tools that support them.
There are indications that most AV products don’t canonicalize PDF documents prior to signature matching. I did some tests with a malicious mailto PDF document, and changing the string representation of the mailto URI action using the hexadecimal forms allows AV detection evasion. Adding whitespace wasn’t necessary, switching to hex was enough. The ClamAV source code for PDF documents has more evidence of PDF canonicalization issues in AV software, here is a string compare for the Length name without canicalization:
This will not match if hex codes are used (#).
I wonder if malicious PDF samples will be used in the Race to Zero.