I recently created 2 blog posts with corresponding videos for the reversing of encodings.
The first one is on the ISC diary: “Decoding Obfuscated BASE64 Statistically“. The payload is encoded with a variation of BASE64, and I show how to analyze the encoded payload to figure out how to decode it.
And this is the video for this diary entry:
And on this blog, I have another example, more complex, where the encoding is a variation of hexadecimal encoding, with some obfuscation: “Another Exercise In Encoding Reversing“.
In this blog post, I will show how to decode a payload encoded in a variation of hexadecimal encoding, by performing statistical analysis and guessing some of the “plaintext”.
I do have the decoder too now (a .NET assembly), but here I’m going to show how you can try to decode a payload like this without having the decoder.
Seeing all these letters, I thought: this is lowercase Netbios Name encoding. That is an encoding where each byte is represented by 2 hexadecimal characters, but the characters are all letters, in stead of digits and letters. Since my tool base64dump.py can handle netbios name encoding, I let it try all encodings:
That failed: no netbios encoding was found. Only base64 and 2 variants of base85, but that doesn’t decode to anything I recognize. Plus, for the last 2 decodings, only 17 unique characters were found. That makes it very unlikely that it is indeed base64 or base85.
Next I use my tools byte-stats.py to produce statistics for the bytes found inside the payload:
There are 17 unique bytes used to encode this payload. The ranges are:
abcdef
i
opqrstuvw
y
This is likely some form of variant of hexadecimal encoding (16 characters) with an extra character (17 in total).
You will find this default processing code in the template:
I am replacing this default code with the following code (I will post a link to the complete program at the end of this blog post):
The content of the file is in variable data. These are bytes.
Since I’m actually dealing with letters only, I’m converting these bytes to characters and store this into variable encodedpayload.
The next piece of code, starting with “data = []” and ending with “data = bytes(data)”, will read two characters from the encodedpayload, and try to convert them from an hexadecimal byte to a byte. If that fails (ValueError), that pair of characters is just ignored.
And then, the last statement, I do an hexadecimal/ascii dump of the data that I was able to convert. This gives me the following:
That doesn’t actually make me any wiser.
Looking at the statistics produced by byte-stats.py, I see that there are 2 letters that appear most frequently, around 9% of the time: d and q.
I do know that the payload is a Windows executable (PE file). PE files that are not packed, contain a lot of NULL bytes. Character 0 is by far the most frequent when we do a frequency analysis of the hexadecimal representation of a “classic” PE file. It often has a frequency of 20% or higher.
That is not the case here for letters d and q. So I don’t know which letter represents digit 0.
Let’s make a small modification to the program, and represent each pair of characters that couldn’t be decoded as hexadecimal, by a NULL byte (data.append(0):
This code produces the following output:
And that is still not helpful.
Since I know this is a PE file, I know the file has to start with the letters MZ. That’s 4D5A in hexadecimal.
The encoded payload starts with ydua. So let’s assume that this represents MZ (4D5A in hexadecimal), thus y is 4, d is d, u is 5 and a is a.
I will now add a small dictionary (dSubstitute) with this translation, and add code to do a search and replace for each of these letters (that’s the for loop):
This code produces the following output:
Notice that apart from MZ, letters DO also appear. DO is 444F in hexadecimal, and is part of the well-known string found at the beginning of (most) PE files: !This program cannot be run in DOS mode
I will know use this string to try to match more letters with hexadecimal digits (I’m assuming the PE file contains this string).
I add the following lines to print out string “!This program cannot be run in DOS mode” in hexadecimal:
This results in the following output:
Notice that the letter T is represented as 54 in hexadecimal. Hexadecimal digits 5 and 4 are part of the digits we already decoded. 5 is u and and 4 is y.
I add code to find the position of the first occurrence of string uy inside the encoded payload:
And this is the output:
Position 86. That’s at the beginning of the payload, so it’s possible that I have found the location of the encoded string “!This program cannot be run in DOS mode”.
I will now add code that does the following: for each letter of the encoded string, I will lookup the corresponding hexadecimal digit in the hexadecimal representation of the unencoded string, and add this decoding pair to the dictionary. If the letter that I add to the dictionary is already present in the dictionary, I compare the stored hexadecimal digit for that letter with the one I looked up, and if they are different, I generate an exception. Because if that happens, I don’t have a one-to-one relationship, and my hypothesis that this is a variant of hexadecimal, is wrong. This is the extra code:
After completing the dictionary, I do a return. I don’t want to do the decoding yet, I just want to make sure that no exception is generated by finding 2 different hexadecimal digits. This is the output:
No exception was thrown: we have a one-to-one relationship.
Next I add 2 lines to see how many and what letters I have inside the dictionary:
This is the output:
That is 14 letters (we have 17 in total). That’s a great result.
I remove the return statement now, to let the decoding take place:
Giving this result:
That is a great result. Not only do I see strings MZ and “!This program cannot be run in DOS mode”, but also PE, .text, .data, .rdata, …
I am now adding code to see which letters I’m still missing:
Giving me this output:
The letters I still need to match to hexadecimal digits are: b, c and q.
I want to know where these letters are found inside the partially decoded payload, and for that I add the following code:
Giving me this result:
The letter q appears very soon: as the 6th character.
Let’s compare this with the start of another, well-known PE file: notepad.exe:
So notepad.exe starts with 4d5a90000300000004
And the partially decoded payload starts with: 4d5a9q03qq04
Let’s put that right under each other:
4d5a90000300000004
4d5a9q03qq04
If I replace q with 000, I match the beginning of notepad.exe.
4d5a90000300000004
4d5a90000300000004
I add this to the dictionary:
And run the program:
That starts to look like a completely decoded PE file.
But I still have letters b and c.
I’m adding some code to see which hexadecimal characters are left unpaired with a letter:
Output:
Hexadecimal digits b and c have not been paired with a letter.
Now, since a translates to a, d to d, e to e and f to f, I’m going to guess that b translates to b and c to c.
I’m adding code to write the decoded payload to disk:
And after running one more time my script, I’m using my tool pe-check.py to validate that I have indeed a properly decoded PE file:
This looks good.
From the process memory dump I have for this malware, I know that I’m dealing with a Cobalt Strike beacon. Let’s check with my 1768.py tool:
This is indeed a Cobalt Strike beacon.
The encoding that I reversed here, is used by GootLoader to encode beacons. It’s an hexadecimal representation, where the decimal digits have been replaced by letters other that abcdef. With an extra twist: while letter v represents digit 0, letter q represent digits 000.
The complete analysis & decoding script can be found here.
cs-decrypt-metadata.py is a new tool, developed to decrypt the metadata of a Cobalt Strike beacon.
An active beacon regularly checks in with its team server, transmitting medata (like the AES key, the username & machine name, …) that is encrypted with the team server’s private key.
This tool can decrypt this data, provided:
you give it the file containing the private (and public) key, .cobaltstrike.beacon_keys (option -f)
you give it the private key in hexadecimal format (option -p)
the private key is one of the 6 keys in its repository (default behavior)
I will publish blog posts explaining how to use this tool.
A couple of years ago, while experimenting with SYLK files, I created a .slk file that caused Excel to crash.
When you create a text file with content “ID;;”, save it with extension .slk, then open it with Excel, Excel will crash.
Microsoft Security Response Center looked at my DoS PoC last year: the issue will not be fixed. It is a “Safe Crash”, Excel detects the invalid input and calls MsoForceAppExitIf to terminate the Excel process.
If you have Excel crashing with .slk files, then look at the first line. If you see something like “ID;;…”, know that the absence of characters between the semi-colons causes the crash. Add a letter, or remove a semi-colon, and that should fix the issue.
This new version of format-bytes.py (a tool to decompose structured binary data with format strings) brings a couple of new features.
Format strings can now be stored in libraries: you can store often used format strings (option -f) in text files and refer to them for using with format-bytes.py. A library file has the name of the program (format-bytes) and extension .library. Library files can be placed in the same directory as the program, and/or the current directory.
A library file is a text file. Each format string has a name and takes one line: name=formatstring.
This defines format string eqn. It can be retrieved with option -f name=eqn.
This format string can be followed by annotations (use a space character to separate the format string and the annotations):
Example:
eqn=<HIHIIIIIBBBBBBBBBB40sIIBB*:XXXXXXXXXXXXXXXXXXsXXXX 1: size of EQNOLEFILEHDR 9: Start MTEF header 14: Full size record 15: Line record 16: Font record 19: Shellcode (fontname)
A line in a library file that starts with # is a comment and is ignored.
Format strings inside a library can be used with option -f. For example, to use format string eqn1, you use option -f name=eqn1. You prefix the format string name with “name=”, like in this example:
Option -s can also take value r now, to select the remainder: -s r. Like this:
The FILETIME format has been added. To use it explicitly, use representation format T.
And finally, with option -F (Find), you can search for values inside a binary file. For the moment, only integers can be searched. Start the option value with #i# followed by the decimal number to search for.
Recently, I combined my tools to achieve the same without a 32-bit disassembler: I extract the strings directly from the binary shellcode.
What I’m looking for is sequences of instructions like this: mov dword [ebp – 0x10], 0x61626364. In 32-bit code, that’s C7 45 followed by one byte (offset operand) and 4 bytes (value operand).
Or: C7 45 10 64 63 62 61. I can write a regular expression for this instruction, and use my tool re-search.py to extract it from the binary shellcode. I want at least 2 consecutive mov … instructions: {2,}.
I’m using option -f because I want to process a binary file (re-search.py expects text files by default).
And I’m using option -x to produce hexadecimal output (to simplify further processing).
I want to get rid of the bytes for the instruction and the offset operand. I do this with sed:
I could convert this back to text with my tool hex-to-bin.py:
But that’s not ideal, because now all characters are merged into a single line.
My tool python-per-line.py gives a better result by processing this hexadecimal input line per line:
Remark that I also use function repr to escape unprintable characters like 00.
This output provides a good overview of all API functions called by this shellcode.
If you take a close look, you’ll notice that the last strings are incomplete: that’s because they are missing one or two characters, and these are put on the stack with another mov instruction for single or double bytes. I can accommodate my regular expression to take these instructions into account:
This is the complete command:
re-search.py -x -f "(?:\xC7\x45.....){2,}(?:(?:\xC6\x45..)|(?:\x66\xC7\x45...))?" shellcode.bin.vir | sed "s/66c745..//g" | sed "s/c[67]45..//g" | python-per-line.py -e "import binascii" "repr(binascii.a2b_hex(line))"
Since it’s open-source, I decided to recompile it with a statically linked C runtime, making it independent of the installed runtime(s). I used Visual Studio 2017 and let it do the default upgrade of the Visual Studio 2012 solution (default implies Windows XP is no longer supported). The only change I made was option /MT to link the runtime into the DLL.
To load the extension, type command “.load” with the full path to the DLL.
Or you can copy the DLL into a folder of the “extension dll search path”. You can view this search path with command “.chain” or “.extpath”:
Then you can just type “.load msec” to load the extension. If you use folders like x86\winext and x64\winext, you can copy the respective x86 and x64 versions without having to rename the DLL.
You can also load the extension and execute the command with one line (!msec.exploitable), like this:
One downside of statically linking the C runtime, is that I will have to recompile the DLLs if the C runtime gets patched to fix a vulnerability.
You can download the recompiled plugins here: MSECWinDbgExtensions.zip (https)
MD5: 090D9E4BE43B7272AA54673C366695E3
SHA256: 39AB11FDF9F80608235CE26833F57A850DD2C36C513EB92C97E28714BA0076FA
I was following Microsoft’s advice to install WinDbg as a post mortem debugger, but didn’t get the expected results.
It turns out that WinDbg x64 version will register itself as the post mortem debugger for 64-bit and 32-bit processes, and not just for 64-bit processes:
Of course, WinDbg x86 version will register itself only for 32-bit processes:
So to make sure that WinDbg x64 version will debug only 64-bit processes and WinDbg x86 version will debug 32-bit processes, run the post mortem registration commands in this order: