In this video, I show how to analyze a .doc malicious document using CyberChef only. This is possible, because the payload is a very long string that can be extracted without having to parse the structure of the .doc file with a tool like oledump.py.
The FlashPix picture format is an old format, based on the Compound File Binary Format (what I like to call OLE files). It has no support for VBA code at all (it doesn’t support any embedded scripting).
However, since it is an ole file, it’s technically possible to add storages and streams containing VBA code. This code can never execute, because the FlashPix specifications does not support it, and hence there are no image viewers that would recognize and execute this code.
And then I took a malicious AutoCAD drawing, and copied the VBA streams and storages into the FlashPix file:
Giving me this file 5040ef90824371a0bd0acaa36263553b.When I submitted this file to VirusTotal a couple of months ago, the AV detection ratio was 29/59. Which is far better than the other “AV-alert pictures” that I created.
I helped a friend creating picture files to be detected by anti-virus. They are not malicious: they don’t execute code neither trigger a vulnerability.
The EICAR test file is detected by many anti-virus programs, except when it is appended to arbitrary files (this is according to specs).
Starting with a one-pixel JPEG and PNG file, I append the EICAR test file. And with a JPEG file, I can also insert the EICAR file as a comment:
The detection scores on VirusTotal show that these files are not detected by many anti-virus programs:
That special ZIP file is a concatenation of 2 ZIP files, the first containing a single PNG file (with extension .jpg) and the second a single EXE file (malware). Various archive managers and security products handle this file differently, some “seeing” only the PNG file, others only the EXE file.
My zipdump.py tool reports the following for this special ZIP file:
zipdump.py is essentially a wrapper for Python’s zipfile module, and this module parses ZIP files “starting from the end of the file”. That’s why it finds the second ZIP file (appended to the first ZIP file), containing the malicious EXE file.
To help with the analysis of such special/malformed ZIP files, I added an option (-f –find) to zipdump. This option scans the content of the provided file looking for ZIP records. ZIP records start with ASCII string PK followed by 2 bytes to indicate the record type (byte values less than 16).
Here I use option “-f list” to list all PK records found in a ZIP file containing a single text file:
This is how a normal ZIP file containing a single file looks on the inside.
The file starts with a “local file header”, a PK record that starts with ASCII characters PK followed by bytes 0x03 and 0x04 (that’s 50 4B 03 04 in hexadecimal). In zipdump’s report, such a PK record is identified with PK0304. This header is followed by the contained file (usually compressed).
Then there is a “central directory header”, a PK record that starts with ASCII characters PK followed by bytes 0x01 and 0x02 (that’s 50 4B 01 02 in hexadecimal). In zipdump’s report, such a PK record is identified with PK0102. This header contains an offset pointing to the corresponding PK0304 record.
And at the end of the ZIP file, there is a “end of central directory”, a PK record that starts with ASCII characters PK followed by bytes 0x05 and 0x06 (that’s 50 4B 05 06 in hexadecimal). In zipdump’s report, such a PK record is identified with PK0506. This header contains an offset pointing to the first PK0102 record.
A ZIP file containing 2 files looks like this, when scanned with zipdump’s option -f list:
Starting with 2 PK0304 records (one for each contained file), followed by 2 PK0102 records, and 1 PK0506 record.
Armed with this knowledge, we take a look at our malicious ZIP file:
We see 2 PK0506 records, and this is unusual.
We see the following sequence of records twice: PK0304, PK0102, PK0506.
From our previous examples, we can now understand that this sample contains 2 ZIP files.
Remark that zipdump assigned an index to both PK0506 records: 1 and 2. This index can be used to select one of the 2 ZIP files for further analysis. Like in this example, where I select the first ZIP file:
Using option “-f 1” (in stead of “-f list”) selects the first ZIP file in the provide sample, and lists its content.
It can then be further analyzed with zipdump like usual, for example, selecting the first file (order.jpg) inside the first ZIP file for an hex/ascii dump:
Likewise, “-f 2” will select the second ZIP file found inside the sample:
-f is a new option that I added for special/malformed ZIP files, but this is a work in progress, as there are many ways to malform ZIP files.
For example, I created a PoC malformed ZIP file that contains a single file, with reversed PK record order. Here is the output for the normal and “reversed” zip files (malformed, e.g. PK records order reversed):
This file can be opened with Windows Explorer, but there are tools and libraries than can not handle it. Like Python’s zipfile module:
I will further develop zipdump to handle malformed ZIP files as best as possible.
AutoCAD’s drawing files (.dwg) can contain VBA macros. The .dwg format is a proprietary file format. There is some documentation, for example here.
When VBA macros are stored inside a .dwg file, an OLE file is embedded inside the .dwg file. There’s a quick-and-dirty way to find this embedded file inside the .dwg file: search for magic sequence D0CF11E0.
My tool cut-bytes.py can be used to search for the first occurrence of byte sequence D0CF11E0 and extract all bytes starting from this sequence until the end of the .dwg file. This can be done with cut-expression [D0CF11E0]: and pipe the result into oledump.py, like this:
Next, oledump can be used to conduct the analysis as usual, for example by extracting the VBA macro source code:
There is also a more structured approach to locate the embedded OLE file inside a .dwg file. When one looks at a .dwg file with a hexadecimal editor, the following can be seen:
First there is a magic sequence identifying this as a .dwg file: AC1032. This sequence varies with the file format version, but since many, many years, it starts with AC10. You can find more details regarding this magic sequence here and here.
At position 0x24 (36 decimal), there is a 32-bit little-endian integer. This is a pointer to the embedded OLE file (this pointer is NULL when no OLE file with VBA macros is embedded).
In our example, this pointer value is 0x00008080. And here is what can be found at this position inside the .dwg file:
First there is a 16-byte long header. At position 8 inside this header, there is a 32-bit little-endian integer that represents the length of the embedded file. 0x00001C00 in our example. And after the header one can find the embedded OLE file (notice magic sequence D0CF11E0).
This information can then be used to extract the OLE file from the .dwg like, like this:
Achieving exactly he same result as the quick-and-dirty method. The reason we don’t have to figure out the length of embedded OLE the file using the quick-and-dirty method, is that oledump ignores all bytes appended to an OLE file.
I will adapt my oledump.py tool to extract macros directly from .dwg files, without the need of a tool like cut-bytes.py, but I will probably implement something like the quick-and-dirty method, as this method would potentially work for other file formats with embedded OLE files, not only .dwg files.
For example, here is a WAV file with a hidden, embedded PE file. The PE file is encoded in the least significant bit of 16-bit integers that encode PCM sound.
I was wondering how I could extract this embedded file with my tools. There was no easy solution, because many of my tools operate on byte streams, but here I have to operate on a bit stream. So I made an update to my format-bytes.py tool.
Using my tool file-magic.py, I get confirmation that this is a sound file (.WAV) with 16-bit PCM data.
And here is an ASCII/HEX dump of the beginning of the file made with cut-bytes.py:
The data chunk starts with magic sequence ‘data’ (in yellow), followed by the size of the data chunk (in green), and then the data itself: 16-bit, little-endian signed integers (in red).
To extract the least significant bit of each 16-bit, little-endian signed integer and assemble them into bytes, I use the latest version of format-bytes.py.
This is the command that I use:
format-bytes.py -a -f “bitstream=f:<H,b:0,j:<” #c#[‘data’]+8: DB043392816146BBE6E9F3FE669459FEA52A82A77A033C86FD5BC2F4569839C9.wav.vir
With option -f, I specify a bitstream format.
f:<H means that the format of the data is little-endian (<), unsigned 16-bit integers (H). I could also specify a signed 16-bit integer (h), but this doesn’t matter here, as I’m not going to use the sign of the integers.
b:0 means that I extract the least-significant bit (position 0) of each 16-bit integer.
j:< means that I assemble (join) these bits into bytes from least significant to most significant (<).
The data starts 8 bytes into the data chunk, e.g. 8 bytes after magic sequence ‘data’. I define this with cut-expression #c#[‘data’]+8:.
When I run this command, and perform an ASCII dump, I get this output for the beginning of the stream:
I can indeed see an executable (MZ), but it is preceded by 4 bytes. These 4 bytes are the length of the embedded file. As described in the article, the length is big-endian encoded. Hence I use a similar command to extract the length, but with j:>, as can be seen here:
The length is 733696 bytes, and this matches the IOCs from the article.
Then I use my tool pecheck.py to search for PE files inside the byte stream (-l P), like this:
MD5 7cb0e1e2cf4a9bf450a350a759490057 is indeed the hash of the malicious DLL encoded in this WAV file.
The binary file format of Office documents (.doc, .xls) uses the Compound File Binary Format, what I like to refer as OLE files. These files can be analyzed with my tool oledump.py.
Starting with Office 2007, the default file format (.docx, .docm, .xlsx, …) is Office Open XML: OOXML. It’s in essence a ZIP container with XML files inside. However, VBA macros inside OOXML files (.docm, .xlsm) are not stored as XML files, they are still stored inside an OLE file: the ZIP container contains a file with name vbaProject.bin. That is an OLE file containing the VBA macros.
oledump.py can look inside the ZIP container to analyze the embedded vbaProject.bin file:
And of course, it can handle an OLE file directly:
When ExifTool is given a vbaProject.bin file for analysis, it will misidentify it as a picture file: a FlashPix file.
That’s because when ExifTool doesn’t have enough metadata or an identifying extension to identify an OLE file, it will fall back to FlashPix file detection. That’s because FlashPix files are also based on the OLE file format, and AFAIK ExifTool started out as an image tool:
That is why on VirusTotal, vbaProject.bin files from OOXML files with macros, will be misidentified as FlashPix files:
When the extension of a vbaProject.bin file is changed to .doc, ExifTool will misidentify it as a Word document:
ExifTool is not designed to identify VBA macro files (vbaProject.bin). These files are not Office documents, neither pictures. But since they are also OLE files, ExifTool tries to guess what they are, based on the extension, and if that doesn’t help, it falls back to the FlashPix file format (based on OLE).
There’s no “bug” to fix, you just need to be aware of this particular behavior of ExifTool: it is a tool to extract information from media formats, when it analyses an OLE file and doesn’t have enough metadata/proper file extension, it will fall back to FlashPix identification.
There are a couple of bug fixes for pdf-parser and pdfid.
And 2 new features in pdf-parser, inspired by a private training on maldoc analysis I gave last week. I often get good ideas from my students, and sometimes, even I get a good idea in class 🙂 .
Option -o can now be used to select multiple objects: separate the indices by a comma.
There’s a new environment variable, PDFPARSER_OPTIONS, that can be used to provide extra options you want to include with each execution of pdf-parser.py. This is useful for option -O, an option to parse stream objects.
It’s actually best to always parse stream objects, i.e. always use option -O. But I decided not to make this an option that is on by default, so that the behavior of pdf-parser would remain unchanged. I consider this important for the many people that rely on a predictable behavior of pdf-parser, like teachers and students of infosec trainings where my tools are used/mentioned.
However, always including option -O is tedious and error prone. So now you can have best of both worlds, by defining an environment variable with name PDFPARSER_OPTIONS and value -O.
And finally, I started to add a man page (option -m), like I do with many of my other tools. This is a work in progress: for the moment, it points to my free PDF analysis e-book that explains the use of pdfid and pdf-parser.
So, I’m trying to sleep but I’m hearing people talk on the other side of the hotelroom wall. But it is an outer wal… twitter.com/i/web/status/1…2 days ago