Didier Stevens

Tuesday 21 April 2020

Handling Diacritics

Filed under: My Software — Didier Stevens @ 0:00

In many languages, letters (basic glyphs) can have accents (diacritics).

Take the common French given name André. It is written with a letter e with an acute accent.

A colleague had to create a list of email addresses from a list of names (given name + surname). Some of the names had letters with accents: these accents had to be removed to keep the basic letter, in order to form a list of email addresses. For example, “andré” had to be converted to “andre”.

I found the Python module Unicode, and told my colleague he could use that module together with my python-per-line.py to generate his list. It turned out I had to make a change to my python-per-line.py tool first, so that it would handle Unicode input properly.

It works as follows. Take this Unicode text file:

Using unidecode method unidecode with python-per-line.py is done like this:

Remark that “é” has been converted to “e”.

Here is a list of names:

And here is the command to convert this list to email addresses:

c:\python37\python python-per-line.py –encoding utf-16 -e “import unidecode” “‘.’.join(unidecode.unidecode(line).lower().split(‘ ‘))+’@target.tld'” unicode-names.txt

Remark that personal names might be more complex than the simple case of “given name + surname”, and that the Python expression might have to be adapted accordingly.

python-per-line_V0_0_7.zip (https)
MD5: 1353108BE499E07745A409568940977F
SHA256: 0086B3780C768717072AC705A0FFEFFA5DD74565B36D4795813BF89E10F88240

Blog at WordPress.com.