I provide 2 days of Hacking PDF training at HITB Amsterdam. This is one of the methods I teach.
Sometimes when I analyze PDF documents (benign or malicious), I want to reduce the PDF to its essential objects. But when one removes objects in a PDF, indexes need to be updated and references updated/removed. To automate this process as much as possible, I updated my pdf-parser program to generate a Python program that in turn, generates the original PDF.
Thus when I want to make changes to the PDF (like removing objects), I generate its corresponding Python program, and then I edit this Python program.
I do this simply with option -g.
Then you can edit the Python program, and when you run it, it will generate a new PDF file.
You can also use option -g together with option -f to filter the streams before they are inserted in the Python program. This gives you the decompressed streams in the Python program, opening them up to editing.
In this example, without option -f the Python statement for the stream object is:
oPDF.stream(5, 0, 'x\x9cs\nQ\xd0w3T02Q\x08IS040P0\x07\xe2\x90\x14\x05\r\x8f\xd4\x9c\x9c|\x85\xf0\xfc\xa2\x9c\x14M\x85\x90,\x05\xd7\x10\x00\xdfn\x0b!', '<<\r\n /Length %d\r\n /Filter /FlateDecode\r\n>>')
And with option -f, it becomes:
oPDF.stream2(5, 0, 'BT /F1 24 Tf 100 700 Td (Hello World) Tj ET', '', 'f')
The generated Python program relies on my mPDF library found in my PDF make tools.