Python regex to convert non-ascii characters in a string to closest ascii equivalents

Question

I'm seeking simple Python function that takes a string and returns a similar one but with all non-ascii characters converted to their closest ascii equivalent. For example, diacritics and whatnot should be dropped. I'm imagining there must be a pretty canonical way to do this and there are plenty of related stackoverflow questions but I'm not finding a simple answer so it seemed worth a separate question.

Example input/output:

"Étienne" -> "Etienne"

Answer 1

Reading this question made me go looking for something better.

https://pypi.python.org/pypi/Unidecode/0.04.1

Does exactly what you ask for.

Answer 2

In Python 3 and using the regex implementation at PyPI:

http://pypi.python.org/pypi/regex

Starting with the string:

>>> s = "Étienne"

Normalise to NFKD and then remove the diacritics:

>>> import unicodedata
>>> import regex
>>> regex.sub(r"\p{Mn}", "", unicodedata.normalize("NFKD", s))
'Etienne'

Answer 3

Doing a search for 'iconv TRANSLIT python' I found: http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/ which looks like it might be what you need. The comments have some other ideas which use the standard library instead.

There's also http://web.archive.org/web/20070807224749/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python/ which uses NFKD to get the base characters where possible.

Answer 4

Read the answers to some of the duplicate questions. The NFKD gimmick works only as an accent stripper. It doesn't handle ligatures and lots of other Latin-based characters that can't be (or aren't) decomposed. For this a prepared translation table is necessary (and much faster).

Python regex to convert non-ascii characters in a string to closest ascii equivalents

Question

4 answers

solution1
4 2013-11-04 14:32:02

solution2
2 2011-04-02 01:24:03

solution3
1 2010-10-04 10:11:09

solution4
1 2011-04-02 01:30:34

Python regex to convert non-ascii characters in a string to closest ascii equivalents

Question

4 answers

solution1 4 2013-11-04 14:32:02

solution2 2 2011-04-02 01:24:03

solution3 1 2010-10-04 10:11:09

solution4 1 2011-04-02 01:30:34

solution1
4 2013-11-04 14:32:02

solution2
2 2011-04-02 01:24:03

solution3
1 2010-10-04 10:11:09

solution4
1 2011-04-02 01:30:34