Didier Stevens

Monday 9 November 2015

byte-stats.py

Filed under: My Software — Didier Stevens @ 0:00

I have a new tool that calculates byte statistics for files, like entropy. I used it recently to help me recover images from a ransomware infection, as described in these SANS ISC Diary entries:

 

Usage: byte-stats.py [options] [files ...]
Calculate byte statistics

files:
wildcards are supported
@file: run command on each file listed in the text file specified

Source code put in the public domain by Didier Stevens, no Copyright
Use at your own risk
https://DidierStevens.com

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -m, --man             Print manual
  -d, --descending      Sort descending
  -k, --keys            Sort on keys in stead of counts
  -b BUCKET, --bucket=BUCKET
                        Size of bucket (default is 10240 bytes)
  -l, --list            Print list of bucket property
  -p PROPERTY, --property=PROPERTY
                        Property to list: encwph
  -a, --all             Print all byte stats
  -s, --sequence        Detect simple sequences
  -f FILTER, --filter=FILTER
                        Minimum length of sequence for displaying (default 0)

Manual:

byte-stats is a tool to calculate byte statistics of the content of files. It
helps to determine the type or content of a file.

Let's start with some examples.
all.bin is a 256-byte large file, containing all possible byte values. The
bytes are ordered: the first byte is 0x00, the second one is 0x01, the third
one is 0x02, ... and the last one is 0xFF.

$byte-stats.py all.bin

Byte ASCII Count     Pct
0x00           1   0.39%
0x01           1   0.39%
0x02           1   0.39%
0x03           1   0.39%
0x04           1   0.39%
...
0xfb           1   0.39%
0xfc           1   0.39%
0xfd           1   0.39%
0xfe           1   0.39%
0xff           1   0.39%

Size: 256

                   File(s)
Entropy:           8.000000
NULL bytes:               1   0.39%
Control bytes:           27  10.55%
Whitespace bytes:         6   2.34%
Printable bytes:         94  36.72%
High bytes:             128  50.00%

First byte-stats.py will display a histogram of byte values found in the
file(s). The first column is the byte value in hex (Byte), the second column is
its ASCII value, third column tells us how many times the byte value appears
(Count) and the last column is the percentage (Pct).
This histogram is sorted by Count (ascending). To change the order use option
-d (descending), to sort by byte value use option -k (key).
By default, the first 5 and last 5 entries of the histogram are displayed. To
display all values, use option -a (all).

After the histogram, the size of the file(s) is displayed.

Finally, the following statistics for the files(s) are displayed:
* Entropy (between 0.0 and 8.0).
* Number and percentage of NULL bytes (0x00).
* Number and percentage of Control bytes (0x01 through 0x1F, excluding
whitespace bytes and including 0x7F).
* Number and percentage of Whitespace bytes (0x09 through 0x0D and 0x20).
* Number and percentage of Printable bytes (0x21 through 0x7E).
* Number and percentage of High bytes (0x80 through 0xFF).

byte-stats.py will also split the file in equally sized parts (called buckets)
and perform the same calculations for these buckets. The default size of a
bucket is 10KB (10240 bytes), but can be chosen with option -b (bucket). If the
file is smaller than the bucket size, no bucket calculations are performed. If
the file size is not an exact multiple of the bucket size, then no calculations
are done for the last bucket (because it is incomplete).

Here is an example with buckets (file random.bin just contains random bytes):

$byte-stats.py random.bin

Byte ASCII Count     Pct
0xce         242   0.32%
0x14         248   0.33%
0x52 R       251   0.34%
0xba         251   0.34%
0x3e >       256   0.34%
...
0x2e .       332   0.44%
0x45 E       336   0.45%
0xc9         336   0.45%
0x1b         338   0.45%
0x75 u       344   0.46%

Size: 74752  Bucket size: 10240  Bucket count: 7

                   File(s)           Minimum buckets   Maximum buckets
Entropy:           7.997180          7.981543          7.984125
                   Position:         0x0000f000        0x00005000
NULL bytes:             303   0.41%        34   0.33%        44   0.43%
Control bytes:         7888  10.55%      1046  10.21%      1117  10.91%
Whitespace bytes:      1726   2.31%       220   2.15%       254   2.48%
Printable bytes:      27278  36.49%      3680  35.94%      3812  37.23%
High bytes:           37557  50.24%      5096  49.77%      5211  50.89%

Besides the file size (74752), the size of the bucket (10240) and the number of
buckets (7) is displayed.
And next to the entropy and byte counters for the complete file, the entropy
and byte counters are calculated for each bucket. The minimum values for the
bucket entropy and byte counters are displayed (Minimum buckets), and also the
maximum values (Maximum buckets).
Position gives the start of the bucket with minimum entropy and maximum entropy
in hexadecimal.
A significant difference between the overal statistics and bucket statistics
can indicate a file that is not uniform in its content.
Like in this picture "encrypted" by ransomware:

$byte-stats.py picture.jpg.ransom

Byte ASCII Count     Pct
0x44 D      1172   0.13%
0x16        1310   0.15%
0x22 "      1371   0.16%
0xc2        1421   0.16%
0x17        1437   0.16%
...
0x7a z      7958   0.91%
0x82        8006   0.91%
0x7e ~      8571   0.98%
0x80       22232   2.53%
0x00       23873   2.72%

Size: 877456  Bucket size: 10240  Bucket count: 85

                   File(s)           Minimum buckets   Maximum buckets
Entropy:           7.815519          5.156678          7.981628
                   Position:         0x00019000        0x00005000
NULL bytes:           23873   2.72%         8   0.08%      1643  16.04%
Control bytes:        92243  10.51%        98   0.96%      1275  12.45%
Whitespace bytes:     16241   1.85%         1   0.01%       263   2.57%
Printable bytes:     303975  34.64%      2476  24.18%      5219  50.97%
High bytes:          441124  50.27%      3728  36.41%      6772  66.13%

The entropy for the file is 7.815519 (encrypted or compressed), but there is
one part of the file (bucket) with an entropy of (5.156678). This part is not
encrypted or compressed.
To locate this part, option -l (list) can be used to list the entropy values
for each bucket:

$byte-stats.py -l picture.jpg.ransom

0x00000000 7.978380
0x00002800 7.979475
0x00005000 7.981628
0x00007800 7.267890
0x0000a000 6.579047
0x0000c800 6.798210
0x0000f000 6.733402
0x00011800 6.496882
0x00014000 5.743983
0x00016800 5.488550
0x00019000 5.156678
0x0001b800 5.330629
0x0001e000 6.057448
0x00020800 6.425884
0x00023000 6.880007
0x00025800 6.856647
...

The bucket starting at position 0x00019000 has the lowest entropy.

A list for the other properties (NULL bytes, ...) can be produced by using
option -l together with option -p (property). For example options "-l -p n"
will produce a list of the number of NULL bytes for each bucket.

Option -s (sequence) instructs byte-stats to search for simple byte sequences.
A simple byte sequence is a sequence of bytes where the difference (unsigned)
between 2 consecutive bytes is a constant.
Example:

$byte-stats.py -s picture.jpg.ransom

Byte ASCII Count     Pct
0x44 D      1172   0.13%
0x16        1310   0.15%
0x22 "      1371   0.16%
0xc2        1421   0.16%
0x17        1437   0.16%
...
0x7a z      7958   0.91%
0x82        8006   0.91%
0x7e ~      8571   0.98%
0x80       22232   2.53%
0x00       23873   2.72%

Size: 877456  Bucket size: 10240  Bucket count: 85

                   File(s)           Minimum buckets   Maximum buckets
Entropy:           7.815519          5.156678          7.981628
                   Position:         0x00019000        0x00005000
NULL bytes:           23873   2.72%         8   0.08%      1643  16.04%
Control bytes:        92243  10.51%        98   0.96%      1275  12.45%
Whitespace bytes:     16241   1.85%         1   0.01%       263   2.57%
Printable bytes:     303975  34.64%      2476  24.18%      5219  50.97%
High bytes:          441124  50.27%      3728  36.41%      6772  66.13%

Position    Length Diff Bytes
0x00013984:    246  128 0x8000800080008000800080008000800080008000...
0x00013c01:    206  128 0x0080008000800080008000800080008000800080...
0x0001b186:    205  128 0x8000800080008000800080008000800080008000...
0x0001b406:    205  128 0x8000800080008000800080008000800080008000...
0x0001b906:    204  128 0x8000800080008000800080008000800080008000...
0x0001bb86:    204  128 0x8000800080008000800080008000800080008000...
0x0001be06:    200  128 0x8000800080008000800080008000800080008000...
0x0001c086:    200  128 0x8000800080008000800080008000800080008000...
0x0001c306:    200  128 0x8000800080008000800080008000800080008000...
0x0001c586:    196  128 0x8000800080008000800080008000800080008000...

Position is the start of the detected sequence, Length is the number of bytes
in the sequence, Diff is the difference (unsigned) between 2 consecutive bytes
and Bytes displays the hex values of the start of the sequence.
By default, the 10 longest sequences are displayed. All sequences (minimum 3
bytes long) can be displayed with option -a. To sort the sequences by position
use option -k (key). To filter the sequences by length, use option -f.

Sequence detection is useful as an extra check when the entropy and byte
counters indicate the file is random:

$byte-stats.py -s not-random.bin

Byte ASCII Count     Pct
0x00          16   0.39%
0x01          16   0.39%
0x02          16   0.39%
0x03          16   0.39%
0x04          16   0.39%
...
0xfb          16   0.39%
0xfc          16   0.39%
0xfd          16   0.39%
0xfe          16   0.39%
0xff          16   0.39%

Size: 4096

                   File(s)
Entropy:           8.000000
NULL bytes:              16   0.39%
Control bytes:          432  10.55%
Whitespace bytes:        96   2.34%
Printable bytes:       1504  36.72%
High bytes:            2048  50.00%

Position    Length Diff Bytes
0x00000000:   4096    1 0x000102030405060708090a0b0c0d0e0f10111213...

byte-stats_V0_0_3.zip (https)
MD5: 4287A94EC56E0BF5A936C2A16DA7F2B4
SHA256: 310B15865B332FF62F2C70CE441D322491DB79BC5D1C8D8BBC9A7245005491B5

5 Comments »

  1. […] my byte-stats.py […]

    Pingback by byte-stats.py | Didier Stevens Videos — Thursday 12 November 2015 @ 10:56

  2. […] byte-stats.py […]

    Pingback by Overview of Content Published In November | Didier Stevens — Friday 11 December 2015 @ 0:00

  3. […] par  {0x93, 0xfe, 0x56, 0x89, <16 Octects de l’ID>}. En utilisant l’outils byte-stats.py (toujours de notre ami belge), on constate une entropie quasi parfaite du fichier : souvent […]

    Pingback by Locky l’épidémie | Cryptobourrin — Friday 19 February 2016 @ 22:37

  4. […] case I helped with, the file was not completely encrypted. The file had parts with low entropy, as byte-stats.py […]

    Pingback by Recovering A Ransomed PDF | Didier Stevens — Tuesday 7 June 2016 @ 0:00

  5. […] tool byte-stats.py calculates statistics for the files it analyzes. With option -l (and -p) , it produces a list of […]

    Pingback by Update: byte-stats.py Version 0.0.7 | Didier Stevens — Friday 3 November 2017 @ 20:59


RSS feed for comments on this post. TrackBack URI

Leave a Reply (comments are moderated)

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.