Didier Stevens

Monday 7 October 2013

Finding Contained Files

Filed under: Forensics,My Software — Didier Stevens @ 0:00

Some time ago I had to figure out if a file was embedded inside another file.

It’s not a file carving problem. I had both files. I just needed to be sure that file A was contained inside file B.

With a hex editor I could find parts of file A inside file B, but it looked like file A was split up and scattered at different locations in file B.

I Googled a bit for a tool, but nothing came up, so I wrote my own Python program.

With my new tool I was able to get assured that file msi49.tmp was inside file c8400.msi:


You can see that file msi49.tmp is one contiguous sequence inside file c8400.msi starting at position 0x3A7200.

But I was more interested to know if file msi49.tmp was also inside file Cisco_Jabber.msi:


And you can see it is, but not as one contiguous sequence. It’s split in 3 sequences.

This tool can also be used to find a downloaded file inside a pcap/pcapng file. I downloaded AnalyzePESig_V0_0_0_2.zip while taking a Wireshark capture.


Or to find a file opened by an application. Here I look into the process dump:


The only limitation is that both files need to be read into memory. But when I’ve time, I’ll turn this into a plugin for the Volatility framework.

The program looks for sequences of at least 10 bytes long (this is an option). If your file is divided in sequences smaller than 10 bytes, then my program will not find the embedded file. Unless you lower the minimum length, but don’t go as low as 1 byte, because then you’re likely to be finding random data.

I’m not 100% sure that my program will find all possible cases of embedded files. No problem if it’s one contiguous sequence, or several sequences in logical order. But I’ve to review my algorithm to be sure it will also find all possible cases of embedded files with sequences in random order. I think it will, but I need to prove it.

find-file-in-file_v0_0_1.zip (https)
MD5: 2984F01404770B92953823D39907B055
SHA256: 1AD124A9A31DACFE1FC9F3B89B3117D3A70D5BC15B712CC1748BEA893612686C


  1. Nice! Now I wonder what msi49.tmp really is. Especially since I use Jabber.

    Comment by Richard — Monday 7 October 2013 @ 12:20

  2. Hi Didier,

    FYI MSIs are a Microsoft Compound Document file format documented here http://www.forensicswiki.org/w/images/5/5b/Compdocfileformat.pdf . I found this out whilst working out how to stick in a metasploit payload into MSIs http://rewtdance.blogspot.co.uk/2013/03/metasploit-msi-payload-generation.html.

    Nice work on the tool though looks useful!

    Comment by Meatballs — Monday 7 October 2013 @ 12:31

  3. @Richard. msi49.tmp is an UPX packed DLL file that is used each time a new profile is created. So each time a user logs on for the first time on a Windows machine, a profile needs to be created, and this dll will execute to configure Jabber.

    This is something you see with other software too.

    This DLL is not part of the Jabber application, that’s why I couldn’t find it in the deployment repository (cab file inside the msi file).

    This DLL is used for the configuration, and as such is stored as a binary stream inside the msi file.

    Comment by Didier Stevens — Monday 7 October 2013 @ 13:44

  4. @Meatballs

    Thanks for the info. I’m familiar with msi files, I created them back in my life as a dev. They are essentially a file format using tables.

    Files embedded in msi files are usually put inside a cab file.
    But this was not the case here with the tmp file, the reason, I found out, is that this tmp file is not part of the installation, but is needed for the configuration. See previous comment.

    Comment by Didier Stevens — Monday 7 October 2013 @ 14:23

  5. Really a great tool. It would be useful to make it recursive, I mean add an option to search a file in a list of files, (actually I’m using a batch file, something like “for each file found in directories run find-file-in-file.py”)

    Comment by shinnai — Wednesday 16 October 2013 @ 8:07

  6. Hi Didier,

    Just asking as a beginner, if I understand it correctly, some of the malicious files such as PDF has embedded malicious files, normally we can view this in hexdump. However, I believe youre goal is to check more especially if the embedded file is mixed up. From your tool, If I only new a file (test.pdf) and I dont know the embedded file, can I just do the command “find-file-in-file.py test.pdf” and the embedded file name will be known using that command? I believe you also have a tool to embed a file (I just forgot the name).

    Comment by yagii — Thursday 17 October 2013 @ 3:41

  7. @shinnai Good idea, will be in a next version.

    Comment by Didier Stevens — Thursday 17 October 2013 @ 17:46

  8. @yagii No, this is not for PDF files with embedded files. My PDF tools are better for this.

    Comment by Didier Stevens — Thursday 17 October 2013 @ 17:48

  9. […] made an interesting comment when I released my tool to find contained files: he wanted to know if I could add a batch […]

    Pingback by Update: find-file-in-file.py Version 0.0.3 | Didier Stevens — Friday 15 November 2013 @ 12:55

RSS feed for comments on this post. TrackBack URI

Leave a Reply (comments are moderated)

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.