Didier Stevens

Monday 31 July 2006


Filed under: Reverse Engineering — Didier Stevens @ 17:24

A friend faced the following problem: his company has to provide confidential data to a financial company. To maintain the confidentiality of the data, this financial company provided my friend with a custom-made program to “protect” the data to be provided.

But my friend doesn’t trust unknown programs, he wanted to know exactly what protection this program offered. The financial company didn’t want to provide further details about their program, so my friend called me for help.

To remain confidential, data transferred on public channels must be protected with strong encryption, the implementation of the cryptographic process must be free of errors and the cryptographic keys must be managed securely.

First we get acquainted with the program. In fact, it’s very simple: you start the program, open the file to be protected and save the resulting file with another extension.
So there’s no password to be provided. This is an indication that the cryptographic key is stored in the program. This is no problem, as long as public key cryptography is used. However, if secret-key cryptography is used, the secret key can be retrieved from the program by reverse engineering and can then be used to decrypt the data.

The protected file is much smaller (around 4 times), so compression is involved. A first glance at the protected file with a hex editor (like XVI32) doesn’t reveal much, there’s nothing readable.

One can follow 2 paths to identify if cryptographic methods are used in a program: you can analyze the program and you can analyze the data.

When analyzing the program , the goal is to identify cryptographic algorithms. The cryptographic library can be linked statically or dynamically. For Windows programs, you can use a dependency viewer (like Dependency Walker) to view the imported DLLs. For statically linked programs, you can use FindCrypt2 by Ilfak. It’s an IDA Pro plugin that looks for cryptographic constants in the disassembled code.

We decide to proceed with the analysis of the protected data. Reverse engineering will come later, but I can’t resist a quick peek at the strings in the program (with BinText). These strings stand out to me:

deflate 1.1.4 Copyright 1995-2002 Jean-loup Gailly 
inflate 1.1.4 Copyright 1995-2002 Mark Adler

They come from the zlib library, an open source library for GZip compression.

We encrypt the same data file a second time, and save it with another name. The size of the 2 protected files is the same. Comparing these 2 files with JojoDiff (a binary comparison program) shows that the files are almost identical:

jdiff-w32 -lr test1.prt test2.prt 
       1        1 EQL 10 
      11       11 MOD 256 
     267      267 EQL 2380

The –lr options displays ASCII output with regions (sequential parts of the binary files).
This result shows us that the first 10 bytes and the last 2380 bytes are the same, there’s only a region of 256 bytes that differ. Because most of the file is the same, we can deduce that the protection always uses the same encryption key (the 256 different bytes are probably a structure of status fields like filename, timestamps and other stuff). So this program doesn’t use “fancy” stuff like session keys, salting, initialization vectors, …

Now we protect several other files and compare them: the size is different and only the first 10 bytes are the same.
We formulate our hypothesis for the file format:
Bytes 1-10: header (magic bytes)
Bytes 11-266: status data
Bytes from 267 on: encrypted data

Now we will concentrate on the encrypted data. Strong ciphertext should be hard to distinguish from a series of random bytes. CrypTool is a freeware program which enables you to apply and analyse cryptographic mechanisms. It’s an excellent educational program. We will use it to see how “random” the encrypted data is.

The Analysis / General menu option in CrypTool has several tools to analyze ciphertext (like calculating the entropy), but because we have no clear idea of the results we can expect with strong encryption, we do the following:

We take our data file and use several methods to generate a “transformed” file:

  1. We protect it with the provided program
  2. We ZIP it
  3. We GZip it
  4. We password protect it with ZIP but don’t use compression, just store.
  5. We encrypt it with RSA
  6. We encrypt it with Ncrypt

We analyze each file with the CrypTool cryptanalysis tools and compare the results. I won’t detail each result here, but we have 2 important results.

First, the entropy of our protected file is in the same range as the entropy of the compressed files, rather than the encrypted files:

File Entropy
File 1 7.92
File 2 7.89
File 3 7.93
File 4 7.98
File 5 7.97
File 6 7.97

The maximum entropy is 8.

Second, we find the same periodicity cycle in the protected file and the GZipped file, but at a different offset:

Periodicity analysis of test1.prt: 
No.    Offset    Length    Number of cycles    Cycle content 
1    2637    1    2        .     00
Periodicity analysis of test.gz: 
No.    Offset    Length    Number of cycles    Cycle content 
1    2388    1    2        .     00

The difference in offset is 249 bytes, almost the size of the header and status data (265)!
This is a strong indication that the protected data is just compressed, not encrypted, and that it’s GZip compressed.

The binary comparison of the protected data and the GZipped opens our eyes:

jdiff-w32 -lr test1.prt test.gz 
       1        1 MOD 598 
     599      599 DEL 129 
     727      598 EQL 1791

Both files share the same sequence of 1791 bytes!

We review our hypothesis for the file format:
Bytes 1-10: header (magic bytes)
Bytes 11-266: status data
Bytes from 267 on: encrypted GZipped data

I know that jdiff can be confused when comparing files which start differently but then continue identically, so we decide to compare them starting from the end. We binary reverse both files and compare them again:

jdiff-w32 -lr reverse-test1.prt reverse-test.gz 
       1        1 EQL 2370 
    2371     2370 MOD 19

Wow! The GZipped file is almost completely included in the protected file, except for 19 bytes (this is very likely the GZip header which contains, among other things, the original file name).

To test our hypothesis, we strip the first 266 bytes from the protected file (with the tail command), name it test.gz and decompress it with the gzip command. Success! We have recovered our original file, and we prove that the so-called “protection” provided by the program is not encryption, just standard compression! It can easily be defeated in a few seconds with 2 simple commands: tail and gzip.

This analysis has taken us about 2 hours. My friend has his answer about the protection level provided by the program. Now it’s up to him to report this to his manager and decide how to proceed.

Later on, I started reverse engineering the program.
The first 10 bytes are a fixed string, the so-called magic bytes, used to identify the file type.
The next 256 bytes are just random bytes generated by the program, and have no meaning whatsoever! The program seeds the RNG with the current time, explaining why protecting the same file twice gives a different 256 byte sequence.

By now I knew enough to formulate a final, proven hypothesis about the file format:
Bytes 1-10: header (magic bytes)
Bytes 11-266: status data garbage
Bytes from 267 on: encrypted GZipped data.

Yet Another Case of Security Through Obscurity. Or, quoting Bruce Schneier, “Snake Oil”!

Blog at WordPress.com.