Didier Stevens

Tuesday 29 April 2008

PDF, Let Me Count the Ways…

Filed under: Malware,PDF — Didier Stevens @ 6:21

In this post, I show how basic features of the PDF language can be used to generate polymorphic variants of (malicious) PDF documents. If you code a PDF parser, write signatures (AV, IDS, …) or analyze (malicious) PDF documents, you should to be aware of these features.

Official language specifications are interesting documents, I used to read them from front to back. I especially appreciate the inclusion of a formal language description, for example in Backus–Naur form. But nowadays, I don’t take the time to do this anymore.

While browsing through the official PDF documentation, I took particular interest in the rules to express lexemes. There are many ways to write the same token, offering opportunities to evade known-pattern recognition systems, like AV and NIDS.

Building a test file

Before I show some examples, let’s build a test PDF file that will start the default browser and navigate to a site each time the document is opened.

Opening a web page from a PDF file can be done with an URI action, like this:

This is the same type of object used in the malicious mailto PDF files.

An action must be triggered by an event, examples of such triggers are the association of an action to the display of a page or the opening of the PDF document. We will use the OpenAction to trigger our URI action object each time our test PDF document is opened:

I add the URI action object and the OpenAction event to the hello world PDF file I used in a previous post, to build our test PDF. You can download all examples here. Opening the test PDF document launches IE:

Now that we have our test PDF, let’s look at the ways we can change its representation without changing its rendering. This is what I’m covering (this list is not exhaustive):

  • Names
    • Hexadecimal encoding
  • Strings
    • Newline escaping
    • Octal encoding
    • Hexadecimal encoding
    • Hexadecimal whitespacing
    • Encryption

Name representation

The tokens preceded by a / (slash) in the URI action object are called Names in the official PDF description. Names are case-sensitive. The characters used in a Name are limited to a specific set, but since PDF specification version 1.2, a lexical convention has been added to represent a character with its hexadecimal ANSI-code, like this #XX.

This allows use to rewrite the /URI name in several ways, for example: #55RI.

Or #55#52#49

Pattern matching algorithms must take into account these different representations to successfully match a pattern. A standard way to deal with this is canonicalization. First, the token is reduced to a canonical form (e.g. replace all #xx representations by the character they stand for), and second, pattern matching is performed on the canonical form.

String representation

Strings too can be represented in many forms. One way to represent strings, is to type the text between parentheses:

Splitting strings over several lines can be done by adding a backslash (\) at the end of each line:

Of course, we are not limited by the numbers of lines, we can add a backslash after each character:

A character in a string can be represented by its octal code, like this:

And this can be done for every character in the string:

One more way to represent a string, is hexadecimal:

You’re allowed to put whitespace between the hex digits:

And you’re not limited in the amount of whitespace you use:

This whitespace usage reminds me of the IE zero-byte trick in html.

I want to finish this long list of examples with PDF encryption. One more way to change the representation of a PDF document is encryption. PDFs can be encrypted without requiring the user to provide a password to view the encrypted document, this form of encryption is used for DRM. Ever had a PDF with printing or text copy disabled? That’s an encrypted PDF.

When a PDF is encrypted, only the strings and streams are encrypted, the objects themselves are not encrypted. Encrypted strings are one more way to change the representation of a string.

Here’s an example:

I know that PDF encryption has already been used to mislead SPAM filters.

Final thoughts

These many features of the PDF language providing flexibility in representation of names and strings, can also be used to generate polymorphic forms of the same malicious PDF. If you need to scan PDF documents, you need to be aware of all these features and have tools that support them.

There are indications that most AV products don’t canonicalize PDF documents prior to signature matching. I did some tests with a malicious mailto PDF document, and changing the string representation of the mailto URI action using the hexadecimal forms allows AV detection evasion. Adding whitespace wasn’t necessary, switching to hex was enough. The ClamAV source code for PDF documents has more evidence of PDF canonicalization issues in AV software, here is a string compare for the Length name without canicalization:

This will not match if hex codes are used (#).

I tested all my examples with Adobe Acrobat Reader 8.1.2 and Foxit Reader 2.2 without problems. But Foxit Reader 2.2 gave me an unpleasant surprise, more on this in a next post.

I wonder if malicious PDF samples will be used in the Race to Zero.

Monday 21 April 2008

“Only X Out of 32 Antivirus Products Detect This!”

Filed under: Malware — Didier Stevens @ 6:47

Ever seen a title like this before? Do you know what it means? It usually means that the author didn’t actually test the malware sample on 32 Windows machines, each protected by a different AV product, but that he uploaded the sample to the free VirusTotal service and received a report.

Testing the detection of a malware with 32 AV products and submitting the malware to the VirusTotal services are two different things. Assuming that these tests are equivalent, and implicitly supposing that the results are the same, is plain wrong.

I read enough presentations and articles talking about “tested with 32 AV products” without even mentioning VirusTotal. And that is at least misleading, if not more. To me, “32 AV products” strongly suggests “tested with VirusTotal”, and not “we really tested 32 AV products”.

Julio Canto from VirusTotal was kind enough to answer a couple of questions I had about the free service they are providing.

First of all, VirusTotal uses command-line AV scanners that require no installation, this way they can run 32 different AV products on the same Windows box. These AV scanners run in sequential order when a file is submitted. An active AV product and a command-line AV product are 2 different things, with different goals, fulfilling different needs. Take McAfee for example. McAfee VirusScan Enterprise has a feature called ScriptScan that will intercept and scan each VBScript and JavaScript before it is execute by the Microsoft script engine. The command-line version of McAfee doesn’t have this feature. So if you let VirusTotal scan an heavily obfuscated script, it’s likely that the McAfee command-line scanner used by VirusTotal will not detect it. But it’s likely that McAfee VirusScan will detect it with ScriptScan, before it gets executed.

It’s the AV vendor that decides which version of his product will be used by VirusTotal and how it has to be configured. Some vendors will even provide beta versions of their product for the VirusTotal team to use. VirusTotal has a NDA with most vendors, that’s why they don’t provide the configuration details for each AV engine. Some vendors are conservative in their settings, while others will use all options (like heuristics).

VirusTotal does not executed submitted files in a sandbox, they are just scanned by the AV engines.

If you don’t get 32 results in your report, but less, it means that an AV engine timed-out (didn’t respond in the allotted time, and the process was killed) and didn’t provide a detection report. The VirusTotal service uses a cluster of 16 machines.

Although the VirusTotal service generates a lot of data that contains a wealth of statistics, they don’t usually look for trends. The company behind VirusTotal (Hispasec), is not involved in the AV world at all, but can use some of the statistics for consulting services.

VirusTotal implemented an anti-abuse system: if one source is submitting too much samples in a too short time period, subsequent request will be refused. This is done to provide all users an equal access to the service.

To finish, Julio gave me some links to similar services:

And remember, when you’re using the VirusTotal service, you’re testing your submitted sample, you’re not testing the AV products. At most, you could say you’re testing bare AV engines with a configuration that is unknown to you.

Saturday 19 April 2008

Taking the GSSP-C Exam

Filed under: Announcement,Certification — Didier Stevens @ 11:10

I’ve a blogpost over at the PaulDotCom Community Blog about my GSSP-C certification.

Wednesday 16 April 2008

Quickpost: Linux Kernel Joke

Filed under: Nonsense,Quickpost — Didier Stevens @ 9:29

A colleague challenged me, half jokingly, to perform a code review of the Linux kernel. I took his challenge: I downloaded the latest stable kernel sources and used a state of the art static code checker (grep -hEir “hack|crack|backdoor|keygen” *).

I located a couple of backdoors:

Some cracks:

And even some keygens:

And the number of hacks was countless (1000+), here is a selection:

Quickpost info

Tuesday 15 April 2008

Update: Disitool V0.2

Filed under: My Software — Didier Stevens @ 8:25

Ero Carrera’s latest version of pefile has extra methods to handle the checksum of the PE header. My new disitool version uses these methods to correct the checksum when the signature is changed by disitool.

Wednesday 9 April 2008

Quickpost: About the Physical and Logical Structure of PDF Files

Filed under: PDF,Quickpost — Didier Stevens @ 6:57

Here is a post to explain in detail PDF polymorphism mentioned in my BH post.

This is a simple “Hello World”-PDF viewed with a text editor:

It is composed of:

  • a header
  • a list of objects
  • a cross reference table
  • a trailer

What I describe here is the physical structure of a PDF file. The header identifies that this is a PDF file (specifying the PDF file format version), the trailer points to the cross reference table (starting at byte position 642 into the file), and the cross reference table points to each object (1 to 7) in the file (byte positions 12 through 518). The objects are ordered in the file: 1, 2, 3, 4, 5, 6 and 7.

The logical structure of a PDF file is an hierarchical structure, the root object is identified in the trailer. Object 1 is the root, object 2 and 3 are children of object 1, etc…, giving this logical structure:

The physical structure of a PDF file can be transformed into another physical structure, without changing the logical structure. Here is the same file, but now the objects are ordered from 7 to 1 (I reversed the order in which the objects appear in the file):

I also had to update the cross reference table, because each object is located at a different position now. But apart from that, nothing has changed. The root is still object 1, and the tree is the same. In other words, the logical structure of the file remained unchanged, which implies that the rendering of both PDF files is identical. Objects can appear at random positions in a PDF file without impact on the logical file structure (i.e. rendering). For this simple file, with 7 objects, I have 5020 5040 (that’s 7!) possible physical structures, just by reordering the objects. And reordering objects is just one way to mutate the physical structure of a PDF file.

You can download both PDF files here.

Quickpost info


Tuesday 8 April 2008

Quickpost: Back from Black Hat Europe 2008

Filed under: Hacking,Quickpost — Didier Stevens @ 7:44

Back from Black Hat Europe 2008, my laptop has undergone another lobotomy.

Mikko from F-Secure was in my training class.

Some briefings I really liked:

  • New Viral Threats of PDF Language
    Good overview of the format of PDF files, and the inherent security issues. Good demos (like rewriting the Acrobat reader alert dialog box to mislead the user) and interesting insights (a PDF has a logical and physical structure, changing the physical structure doesn’t change the content of the document: this is polymorphism). The speaker confirmed that his exploits don’t affect Foxit reader. But the slides don’t to this justice, let’s hope they publish more details. And it was fun to see some French military lingo popping up in a BH presentation.
  • Intercepting Mobile Phone/GSM Traffic
    THC explained how they cracked GSM A5/1 encryption, FPGA style and with 2 TB of rainbow tables. Interesting tidbits: mobile operators don’t provide the strongest available encryption A5/3 (my guess as to why: cost), and the GSM status channel will carry permanent subscriber IDs, although the protocol only foresees temporary IDs.
  • Mobile Phone Spying Tools
    Tools mainly used by untrusting spouses, but I see potential uses for industrial espionage: sales man leaves company for competition, installs mobile phone spying tool on his corporate mobile phone just before handing it back.
  • DTRACE: The Reverse Engineer’s Unexpected Swiss Army Knife
    Looks really powerful and flexible, let’s hope someone is brave enough to attempt a Windows port.

And the networking was great, shout-out to Malta Info Security.

Quickpost info

Carnival of the Security Catalyst Community 2008/04/08

Filed under: Fellow Bloggers — Didier Stevens @ 7:38

The Security Catalyst Community is a free forum for IT security professionals, it’s one of the few communities where I’m an active member. One of the things I like about the SCC is that a lot of the discussions are non-technical. Let me illustrate this by highlighting some message threads (you’ll need to create an account if you want to read these):

There are no trolls in the SCC, it’s low-volume, and sometimes, someone comes with a technical puzzle that will get my eager attention. And you’ll get the opportunity to discuss with security authors, bloggers and podcasters like Rebecca, Martin and Harlan.

Blog at WordPress.com.