Kristjan Kannike

“ASCIIfying” a Unicode String

Say we want to keep our file names or URLs in plain ASCII, but the strings from which the former are generated, have diacritical marks like umlauts and tildes. How can it be done?

First, we import the decomposition function from the unicodedata library. The only thing we need from it is decomposition. It separates composable characters into their hexadecimal encodings like >>> unicodedata.decomposition(u'õ')
'006F 0303'

(For a non-decomposable character such as u'a' , it returns just ''.)

Let us see these parts: >>> u'\u006F'
u'o'
is the ASCII character without embellishments.

Another way to get the character corresponding to a character code is the unichr() function that is in fact used in the ASCIIfying function below. E.g. >>> unichr(0x006F)
u'o'

But >>> '\u0303'
u'\u0303'

It is a tilde, of course, as we see by asking for its name >>> unicodedata.name(u'\u0303')
'COMBINING TILDE'
or even better, printing it to the screen:>>> print u'\u0303'
~

But I digress. We see that the plain ASCII character is the first one among the decomposed parts.

Therefore, to “ASCIIfy” a character, we only need to decompose it if possible, split the string, take its zeroth part and convert it to a base 16 integer with int().

One could elaborate on it, of course, by omitting non-Latin characters like π, or converting them to their Latin equivalents.

Putting it all together, we have from unicodedata import decomposition

def asciify(string):
'''"ASCIIfy" a Unicode string
by stripping all umlauts, tildes, etc.'''

temp = u''
for char in string:
decomp = decomposition(char)
if decomp: # Not an empty string
temp += unichr(int(decomp.split()[0], 16)) else:
temp += char return temp