Some new options for my tool sortcanon.py to handle more inputs.
A bit of context: when one sorts a list of IPv4 addresses as text, one gets a result as follows. Take this list:
Just sorting this gives this result:
The IPv4 address starting with 185 comes first, because by default, sorting is string based and digit 1 comes before digit 3.
With sortcanon, one can provide a Python function that will be used to interpret the input and achieve the desired sorting. There are a couple of builtin functions, like ipv4. This is the result:
This time, the IPv4 address starting with 185 comes last, because it has the highest most significant byte.
Recently, I had to sort some files where with extra data, like IPv4 addresses with port numbers. Something like this list:
But this did not work:
Because the function that parses IPv4 addresses, does not expect a port number.
I could create a custom function to handle this, but I pursued another solution. I added an option to select the part of the line, that will be used for sorting, with a regular expression. This is done with option -s (select). Like this:
Regular expression “^([^ ]+) ” selects all characters from the beginning of the line (^) until the first space character (excluded). This selection is stored in a capture group (), and the ipv4 sorting function takes this capture group as input, in stead of the complete line.
The list I selected as example, has some duplicate IPv4 addresses:
If I use option -u (unique), duplicate lines are removed:
But of course the lines with identical IPv4 address 53… remain, because the lines themselves are different (different port number).
This is the desired result, most of the time. But I had an exceptional case, where I had to drop duplicate IPv4 addresses, but still keep one port number. This can be done with option –selectoptions u:
Some changes to the translate option: now it supports this format (like some of my other tools):
i=codec[:error],o=codec[:error]
i= is input and o= is output. If you don’t specify an error handling mode, strict will be used.
An example of the format is: i=utf16,o=latin:ignore This will read binary data in utf16 strict mode, and convert it to binary data in ANSI (latin) and ignore all utf16 characters that can not be represented in latin.