“ASCIIfying” a Unicode String
Say we want to keep our file names or URLs in plain ASCII, but the strings from which the former are generated, have diacritical marks like umlauts and tildes. How can it be done?
First, we import the decomposition
function from the unicodedata
library. The only thing we need from it is decomposition
. It separates composable characters into their hexadecimal encodings like >>> unicodedata.decomposition(u'õ')
'006F 0303'
(For a non-decomposable character such as u'a'
, it returns just ''
.)
Let us see these parts: >>> u'\u006F'
is the ASCII character without embellishments.
u'o'
Another way to get the character corresponding to a character code is the unichr()
function that is in fact used in the ASCIIfying function below. E.g. >>> unichr(0x006F)
u'o'
But >>> '\u0303'
u'\u0303'
It is a tilde, of course, as we see by asking for its name >>> unicodedata.name(u'\u0303')
or even better, printing it to the screen:
'COMBINING TILDE' >>> print u'\u0303'
~
But I digress. We see that the plain ASCII character is the first one among the decomposed parts.
Therefore, to “ASCIIfy” a character, we only need to decompose it if possible, split the string, take its zeroth part and convert it to a base 16 integer with int()
.
One could elaborate on it, of course, by omitting non-Latin characters like π, or converting them to their Latin equivalents.
Putting it all together, we have from unicodedata import decomposition
def asciify(string):
'''"ASCIIfy" a Unicode string
by stripping all umlauts, tildes, etc.'''
temp = u''
for char in string:
decomp = decomposition(char)
if decomp: # Not an empty string
temp += unichr(int(decomp.split()[0], 16)) else:
temp += char return temp