Didier Stevens

Monday 22 February 2021

re-search.py And Custom Validations

Filed under: My Software — Didier Stevens @ 0:00

My tool re-search.py is a tool that uses regular expressions to search through files. You can use regular expressions from a small builtin library, or provide your own regular expressions.

And these regular expressions can be augmented with extra conditions, like validation with a custom Python function.

I’m going to illustrate this here with a regular expression to match credit card numbers. Credit card numbers have a check digit (calculated with the Luhn algorithm) and I’m going to augment the regular expression to validate the check digit.

I’m using the following regular expression to match credit card numbers (I’m limiting myself to credit card numbers of 16 digits): \b(\d{4}( ?)\d{4}\2\d{4}\2\d{4})\b

This regular expression consist of 4 expressions to match 4 digits “\d{4}”. Each block of 4 digits could be separated with a space character ” ?”.

I’m putting this in a capture group “( ?)” so that I can refer back to this matched group with backreference \2 (it’s the second capture group, because the complete credit card number is also put in a capture group, e.g. the first capture group).

The reason I’m using a backreference to match the first optional space character, is because I want to match the next 2 separating space characters if and only if a first space character was matched. So I want to match (1111222233334444 and 1111 2222 3333 4444, but not 11112222 3333 4444 for example). Either all 4 groups are separated, or none are separated.

Finally, I put this expression in a capture group, and enclose it with a boundary check “\b”. This is to avoid matching credit card numbers that are immediately preceded or followed by letters or digits.

So I can use this regular expression with re-search.py on a test file:

You can see that the first 2 test credit card numbers are identical, except for the last digit: the check digit. So at most one of these 2 can be a valid credit card number.

This can be checked with the Luhn algorithm.

Here is a small Python script to calculate this Luhn check digit:

# 2020/02/06
# https://stackoverflow.com/questions/21079439/implementation-of-luhn-formula

import string

def luhn_checksum(card_number):
    def digits_of(n):
        return [int(d) for d in str(n)]
    digits = digits_of(card_number)
    odd_digits = digits[-1::-2]
    even_digits = digits[-2::-2]
    checksum = 0
    checksum += sum(odd_digits)
    for d in even_digits:
        checksum += sum(digits_of(d*2))
    return checksum % 10

def is_luhn_valid(card_number):
    return luhn_checksum(card_number) == 0

def CCNValidate(ccn):
    return is_luhn_valid(''.join(digit for digit in ccn if digit in string.digits))

Python function luhn_checksum calculates the check digit for an input of digits, and Python function is_luhn_valid return True when the calculate Luhn number matches the check digit.

To use this last function with the regular expression I created, I need another Python function: CCNValidate. This function receives the string matched by the regulator expression, extracts the digits and checks the Luhn check digit.

To let my tool re-search.py call this function CCNValidate when a credit card number is matched by the regular expression, I precede the regular expression with a comment, like this:

(?#extra=P:CCNValidate)\b(\d{4}( ?)\d{4}\2\d{4}\2\d{4})\b

(?#…) is a comment in the regular expression syntax. It is ignored by the parser (i.e. not used for matching). … is the comment itself, which can be anything.

re-search.py interprets this comment: when the comment starts with “extra=”, re-search is dealing with an augmented regular expression. P indicates that a Python function has to be called when the regular expression matches, and CCNValidate is the name of the Python functon to call when the regular expression matches.

All this combined gives me the following command:

You can see that the first credit card number that was matched in the first example, no longer matches: that’s because 6 is not the correct Luhn number for this credit card number.

Besides providing re-search.py with this augmented regular expression, I also need to provide the Python script containing the validation functions: I do this with option –script CCNValidate.py



Blog at WordPress.com.