Didier Stevens

Tuesday 1 September 2015

nsrl.py: Using the Reference Data Set of the National Software Reference Library

Filed under: Forensics,My Software — Didier Stevens @ 0:00

When I scan executables on a Windows machine looking for malware or suspicious files, I often use the Reference Data Set of the National Software Reference Library to filter out known benign files.

nsrl.py is the program I wrote to do this. nsrl.py can read the Reference Data Set directly from the ZIP file provided by the NSRL, no need to unzip it.


Usage: nsrl.py [options] filemd5 [NSRL-file]
NSRL tool

–version             show program’s version number and exit
-h, –help            show this help message and exit
separator to use (default is ; )
-H HASH, –hash=HASH  NSRL hash to use, options: SHA-1, MD5, CRC32 (default
-f, –foundonly       only report found hashes
-n, –notfoundonly    only report missing hashes
-a, –allfinds        report all matching hashes, not just first one
-q, –quiet           do not produce console output
-o OUTPUT, –output=OUTPUT
output to file
-m, –man             Print manual


nsrl.py looks up a list of hashes in the NSRL database and reports the
results as a CSV file.

The program takes as input a list of hashes (a text file). By default,
the hash used for lookup in the NSRL database is MD5. You can use
option -H to select hash algorithm sha-1 or crc32. The list of hashes
is read into memory, and then the NSRL database is read and compared
with the list of hashes. If there is a match, a line is added to the
CSV report for this hash. The list of hashes is deduplicated before
matching occurs. So if a hash appears more than once in the list of
hashes, it is only matched once. If a hash has more than one entry in
the NSRL database, then only the first occurrence will be reported.
Unless option -a is used to report all matching entries of the same
hash. The first part of the CSV report contains all matching hashes,
and the second part all non-matching hashes (hashes that were not
found in the NSRL database). Use option -f to report only matching
hashes, and option -n to report only non-matching hashes.

The CSV file is outputted to console and written to a CSV file with
the same name has the list of hashes, but with a timestamp appended.
To prevent output to the console, use option -q. T choose the output
filename, use option -o. The separator used in the CSV file is ;. This
can be changed with option -s.

The second argument given to nsrl.py is the NSRL database. This can be
the NSRL database text file (NSRLFile.txt), the gzip compressed NSRL
database text file or the ZIP file containing the NSRL database text
file. I use the “reduced set” or minimal hashset (each hash appears
only once) found on http://www.nsrl.nist.gov/Downloads.htm. The second
argument can be omitted if a gzip compressed NSRL database text file
NSRLFile.txt.gz is stored in the same directory as nsrl.py.

nsrl_V0_0_1.zip (https)
MD5: 5063EEEF7345C65D012F65463754A97C
SHA256: ADD3E82EDABA7F956CDEBE93135096963B0B11BB48473EEC2C45FC21CFB32BAA


  1. I also just completed about the same. But it’s not necessary to load the whole NSRL data into memory. I do a binary tree search in the NSRL File. This has to be sorted though (which has do be done once). So I find about 1000 hashes in the NSRL file per second. So you can pipe the result of eg. a md5deep run through the program and print out only the unknown etc. It runs on even a Raspberry PI with several hundred hashes a second. Also accepts hashes on a TCP port. No need for a complete Nsrllookup server or to to keep the whole database in memory (Several GB!). The code is a bit ugly because it’s one of my first Python but I can release it if there is interest.

    Comment by Bruno — Wednesday 2 September 2015 @ 12:51

  2. @Bruno I don’t read the NSRL data in memory. As I wrote, the program reads your list of hashes into memory, not the NSRL database.

    And if you want speed, I suggest storing the NSRL data in a RDBM.

    Comment by Didier Stevens — Wednesday 2 September 2015 @ 12:55

  3. Be nice if this hashed the on the target computer an then compared them all against the database.

    Comment by Humphrey — Wednesday 2 September 2015 @ 16:42

  4. *hashed the files on the computer and*

    Comment by Humphrey — Wednesday 2 September 2015 @ 16:43

  5. @Humphrey That is the subject of an upcoming blogpost (tip: I use my tool AnalyzePESig). This tool is a pre-requisite.

    Comment by Didier Stevens — Wednesday 2 September 2015 @ 17:31

  6. […] Stevens a partagé un script Python qui permet de comparer une liste de hashs avec la base de données de la National Software […]

    Pingback by nsrl.py : déterminer si un fichier est innofensif | .:[ d4 n3wS ]:. — Friday 4 September 2015 @ 9:17

  7. […] small update to my nsrl.py program: the CSV output now includes the […]

    Pingback by Update: nsrl.py Version 0.0.2 | Didier Stevens — Saturday 21 November 2015 @ 0:01

RSS feed for comments on this post. TrackBack URI

Leave a Reply (comments are moderated)

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: