In this update, you can also save your library with custom regular expressions in the working directory (in prior versions, it would only take it from the application directory).
Here is an example with a regular expression for MAC addresses:
And there’s a small fix for URL regex: a – character was not considered to be part of the query of a URL.
There is no /URI reported, but remark that the PDF contains 5 stream objects (/ObjStm). These can contain /URIs. In the past, I would search and decompress these stream objects with pdf-parser.py, and then pipe the result through pdfid.py, in order to detect /URIs (or other objects that require further analysis).
Since pdf-parser.py version 0.7.0, I prefer another method: using option -O to let pdf-parser.py extract and parse the objects inside stream objects.
With option -a (here combined with option -O), I can get statistics and keywords just like with pdfid:
Now I can see that there is a /URI inside the PDF (object 43).
Thus I can use option -k to get the value of /URI entries, combined with option -O to look inside stream objects:
And here I have the /URI.
Another method, is to select object 43:
From this output, we also see that object 43 is inside stream object 16.
Remark: if you use option -O on a PDF that does not contain stream objects (/ObjStm), pdf-parser will behave as if you didn’t provide this option. Hence, if you want, you can always use option -O to analyze PDFs.
This new version of pdf-parser brings support for analysis of stream objects (/ObjStm). Use new option -O to enable this mode.
Stream objects (/ObjStm) are objects that contain other objects: they have a stream, containing other objects. These contained objects can not have a stream.
pdfid.py detects the presence of stream objects:
But pdfid can not look inside a stream, to figure out what objects are inside. That’s why I always say to use pdf-parser to select and decompress stream objects, and then pipe this through pdfid:
When pdf-parser parses a stream object, it does not parse the content of its stream:
This changes with this new version of pdf-parser. When option -O is used, pdf-parser extracts objects from /ObjStm streams and handles them like normal objects. In the following example, object 2 is contained in object 1:
pdf-parser provides statistics for a PDF’s content with option -a:
Combining option -a with option -O includes objects present inside stream objects (this is an alternative for combining both tools: pdf-parser -s objstm -f a.pdf | pdfid -f):
This output shows that /JavaScript can be found in object 7. We need to use option -O to find object 7 “hiding” in object 1:
If we forget to use option -O, object 7 is not found:
I added function ZlibRawD to translate.py to decompress Zlib compression without header (ZlibD already exists, and is for Zlib compression with header).
This compression is sometimes used in malicious PowerShell scripts:
I show how to use this option in a malicious document analysis video below. If you want to jump straight to the point where I use option -C with a UNICODE string, go to 9:16.
This is a bug fix update: for agile encryption, Python module msoffcrypto does not throw an exception in method load_key when an invalid password is provided. It throws an exception when an attempt is made to decrypt the file.
I added a call to method decrypt to handle this case.