Here’s a heads up for some malicious PDF samples that are deliberately malformed to avoid detection.
The most important case is the missing endobj keyword:
Adobe Reader will happily parse a PDF where the object are not terminated with endobj, but my pdf-parser won’t. I’ll have to update the parser to deal with this case.
The cross-reference table can also be omitted:
This is not an issue for my parser.
And then I also received a sample with a stream object, where the case of the endstream object was wrong: Endstream. First we assumed Adobe Reader was not case-sensitive for the endstream keyword, but I found out it can actually parse a stream object with missing endstream keyword:
This is an issue for my parser.