For the sake of this post, I consider a PDF document malformed when it doesn’t observe the basic structure of a PDF document.
I’ve seen a couple of malicious, malformed PDF documents. The most recent was a malicious swine flu PDF document that contains another, bening, PDF document with information about the swine flu (obtained from the CDC site). This second PDF document is displayed to mislead the user while the exploit runs.
This second PDF document is XOR-encoded and appended to the end of the malicious PDF document, making the malicious PDF document malformed (FYI: the PDF file format supports embedded files, but this wasn’t used here). A PDF reader like Adobe or Foxit has no problems opening this malformed PDF, because it scans a PDF document for the trailer (%%EOF) starting from the end of the document. Everything that follows this trailer and doesn’t adhere to the PDF syntax is just ignored.
I’ve added some new features to my PDF tools to handle malformed PDF documents.
For a normal PDF file, expect the total entropy and the entropy of bytes inside stream objects to be close to the maximum value 8.0. This means that the distribution of byte values is close to random, which is characteristic of compressed and encrypted data.
Outside streams objects, the data appears much less random, and the entropy is much lower, usually around 4.0 or 5.0.
However, for malformed PDF documents, where data is added without using stream objects, the entropy outside stream objects is much higher. Here is the report for the malicious swine flu PDF:
Another datum added to the report by using the –extra option is for the end-of-file marker %%EOF.
The “%%EOF” line mentions the number of times %%EOF appears in the document (more than once usually indicates incremental updates). “After last %%EOF” counts the number of bytes after the last %%EOF. This value will be not be zero when data has been appended.
The previous versions of pdf-parser output a lot of “todo 10” data (an indication of malformed PDF data) when they parse a malformed PDF document. I’ve suppresed this behavior, you’ll need to use option –verbose to enable it from now on, should you need it. Since I first use PDFiD to check a PDF document before using pdf-parser, I don’t consider the “todo” output relevant anymore, as PDFiDs entropy and %%EOF report will tell me if a PDF document is malformed.
But the other new option in pdf-parser, –extract, is more important. Example:
pdf-parser.py –extract payload.bin malformed.pdf
This option will extract all malformed data from malformed.pdf and write it to file payload.bin, giving you easy access to the embedded payload.
You can download a normal and malformed Hello World PDF file here to get familiarized with my updated tools. 4096 random bytes have been appended to the end of the PDF document to make it malformed.
Here is a last example when the entropy calculation can be handy even if the payload is stored inside a stream object:
The reason the total entropy and entropy of bytes inside stream objects is very low here, is that this malicious PDF document has a payload with a very long, uncompressed NOP-sled (more than one million times 0x90).