Didier Stevens

Monday 19 May 2008

PDF Stream Objects

Filed under: Malware,PDF — Didier Stevens @ 6:09

A PDF stream object is a sequence of bytes. There is a virtually unlimited number of ways to represent the same byte sequence. After Names and Strings obfuscation, let’s take a look at streams.

A PDF stream object is composed of a dictionary (<< >>), the keyword stream, a sequence of bytes and the keyword endstream. All streams must be indirect objects. Here is an example:

This stream is indirect object 5 version 0. The stream dictionary must have a /Length entry, to document the length of the (encoded) byte sequence. The stream and endstream keywords are terminated with the EOL character(s). In this example, the byte sequence is a set of instructions for the PDF reader to render the string Hello World with a given font at a precise position. It’s precisely 42 bytes long.

In this example, the byte sequence is represented literally, but it’s possible (and usual) to encode the byte sequence. This is done with a stream filter. A stream filter specifies how the sequence of bytes has to be decoded. Let’s take the same example, but with an ASCII85 encoding:

The /Filter entry instructs the PDF reader how to decode the byte sequence (/ASCII85Decode). Notice the change of the length value. There are many encoding schemes (ASCII filters and decompression filters), here is a list:

  • ASCIIHexDecode
  • ASCII85Decode
  • LZWDecode
  • FlateDecode
  • RunLengthDecode
  • CCITTFaxDecode
  • JBIG2Decode
  • DCTDecode
  • JPXDecode
  • Crypt

This list is not so long, so why do I claim an almost limitless number of ways to encode a stream? I have 2 reasons:

  1. Many filters, like /FlateDecode, take parameters (in this case, the compression level), which influence the encoding too
  2. Filters can be cascaded, meaning that the stream has to be decoded by more than one filter

Here is our example, where the stream is encoded twice, first with ASCII85 and then with plain HEX (I know, this is rather pointless, but it yields simple and readable examples):

Cascading filters also inspired me to create a couple of test PDF documents. For example, I’ve created a 2642 bytes small PDF document that contains a 1GB large stream (a ZIP bomb of sorts). Some PDF readers will choke on this document.

17 Comments »

  1. […] security professional Didier Stevens has highlighted a potential exploit in PDF Stream Objects which could be used to cause a PDF file to balloon in size, prompting Computerworld to label it […]

    Pingback by PDF Bomb - PDFalerts — Tuesday 27 May 2008 @ 20:09

  2. Some of these filters cannot be used to hide scripts with exploits, because they do lossy compression and are suitable only for images. I think (but am not 100% sure) that CCITTFaxDecode, JBIG2Decode, DCTDecode and JPXDecode all fall in this category. They might be usable for a denial-of-service attack (the equivalent of the ZIP bomb), although I have my doubts about that too.

    Comment by Vesselin Bontchev — Wednesday 28 May 2008 @ 18:03

  3. It’s true that these filters are lossy, but the first 3 of them take parameters, and I believe it’s possible to parameterize a lossless compression. But I have not tested this.

    Comment by Didier Stevens — Wednesday 28 May 2008 @ 20:40

  4. […] security professional Didier Stevens has highlighted a potential exploit in PDF Stream Objects which could be used to cause a PDF file to balloon in size, prompting Computerworld to label it the […]

    Pingback by PDF Bomb — Thursday 14 August 2008 @ 8:28

  5. […] at the PDF code of the /JBIG2Decode vulnerability. It doesn’t have to be an XObject, just a stream object with a /JBIG2Decode […]

    Pingback by Quickpost: /JBIG2Decode Essentials « Didier Stevens — Monday 2 March 2009 @ 23:11

  6. […] Filed under: My Software, PDF — Didier Stevens @ 0:00 @binjo ’s tweet made me realize PDF filter abbreviations do apply to stream objects too, although the PDF reference document only defines them […]

    Pingback by PDF Filter Abbreviations « Didier Stevens — Monday 11 May 2009 @ 0:01

  7. what is the way to understand the text encoded…can v write the text in its normal form…?bt i want to u’stand flatedecode filter in depth…how to canvert the characters in normal form…?and how are the offsets managed…i want to understand the internal structure of a pdf document in depth..any sites..??

    Comment by khushi — Thursday 5 November 2009 @ 8:05

  8. Take a look at the PDF Reference document http://www.adobe.com/devnet/pdf/pdf_reference.html
    And for flatedecode, research zlib compression.

    Comment by Didier Stevens — Thursday 5 November 2009 @ 16:53

  9. […] security professional Didier Stevens has highlighted a potential exploit in PDF Stream Objects which could be used to cause a PDF file to balloon in size, prompting Computerworld to label it the […]

    Pingback by PDF Bomb | 4x PDF Blog — Monday 1 March 2010 @ 23:44

  10. Hi, Thanks for your introduction about the pdf text object.
    I am trying to decode a pdf file but have a little question about the decoding method, could you give me some instruction to decode these word?

    The result I got from pdf file is:

    BT
    /F0 27 Tf.
    1 0 0 1 60 585.602 Tm.
    (003)Tj.
    1 0 0 1 78.171 585.602 Tm.
    (00W00C00T)Tj.
    1 0 0 1 114.891 585.602 Tm.
    (00V00\\)Tj.
    1 0 0 1 141.486 585.602 Tm.
    (002400&)Tj.

    —————————————
    the decode method is FlateDecode
    According to the document, these word should be “Quartz 2D”.
    But I don’t know how to translate something like (00W00C00T) to the final result.
    Could you give me some instruction about how to translate these code?

    Thank you~~

    Comment by Eric — Monday 12 April 2010 @ 18:43

  11. Sorry , re-type content again.
    BT
    /F0 27 Tf.
    1 0 0 1 60 585.602 Tm.
    (\0003)Tj.                
    1 0 0 1 78.171 585.602 Tm.
    (\000W\000C\000T)Tj.
    1 0 0 1 114.891 585.602 Tm.
    (\000V\000\\)Tj.
    1 0 0 1 141.486 585.602 Tm.
    (\000\024\000&)Tj.

    Comment by Eric — Monday 12 April 2010 @ 18:47

  12. Hi

    how to count the length of a stream or how to rip the bytes of a stream please.

    Comment by shivan — Monday 7 June 2010 @ 5:23

  13. @shivan The stream is the sequence of bytes between the stream and enstream keywords.

    Comment by Didier Stevens — Monday 7 June 2010 @ 7:59

  14. […] Stevens defines a PDF stream object as a sequence of bytes (link here). Stream objects can contain data (for example an image) or […]

    Pingback by Malware Diaries » Blog Archive » Malicious PDF stream objects will be the norm — Wednesday 30 June 2010 @ 17:28

  15. Hi

    i’m new about PDF and i’ve some problem with FlateDecode on PHP. I take the bytes of the stream, between “stream” and “endstream” escaping initial and final Carriage Return and Line Feed. Then i use the gzinflate function but it doesn’t work every time.

    It works only in one case: i note that every stream begin with two characters, then i eliminate them from the stream and so it works. i think it’s not normal!

    Thanks for your help

    Comment by Rubin — Saturday 21 August 2010 @ 14:06

  16. Comment by Eric — Monday 12 April 2010 @ 18:47

    Hello Eric. It is a stack based language, so the parameters are in front of the function,

    Tf specifies the font-face and the font-size. /F12 1 Tf is font 12 and size 1.

    Tj outputs a string. (03)Tj outputs the ascii code 03

    TJ outputs multiple strings, which can be positioned. [(abc)110(xyz)1000(bbb)]TJ -> abc xyz bbb

    Tm set the text matrix.

    There are a dozen operators, which sometimes do the same as a combination of other operators.

    As you see it is not so easy to extract information from a pdf. You need the pdf reference.

    I don’t think this will be a problem in most pdf-readers. Most of them have maximum sizes set on decoded streams.

    Comment by Pino — Monday 15 November 2010 @ 14:50

  17. […] an image before writing them to a file using the commands from open().write(). A <a href=”http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/”>stream object</a> is the way pdf embeds objects, represented below. The find command can be used to […]

    Pingback by Morning Joe/Python PDF Part 3: Straight Optical Character Recognition | Wired Andy Blog — Friday 29 August 2014 @ 14:51


RSS feed for comments on this post. TrackBack URI

Leave a Reply (comments are moderated)

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 239 other followers

%d bloggers like this: