简体   繁体   中英

Python regex to convert non-ascii characters in a string to closest ascii equivalents

I'm seeking simple Python function that takes a string and returns a similar one but with all non-ascii characters converted to their closest ascii equivalent. For example, diacritics and whatnot should be dropped. I'm imagining there must be a pretty canonical way to do this and there are plenty of related stackoverflow questions but I'm not finding a simple answer so it seemed worth a separate question.

Example input/output:

"Étienne" -> "Etienne"

Reading this question made me go looking for something better.

https://pypi.python.org/pypi/Unidecode/0.04.1

Does exactly what you ask for.

In Python 3 and using the regex implementation at PyPI:

http://pypi.python.org/pypi/regex

Starting with the string:

>>> s = "Étienne"

Normalise to NFKD and then remove the diacritics:

>>> import unicodedata
>>> import regex
>>> regex.sub(r"\p{Mn}", "", unicodedata.normalize("NFKD", s))
'Etienne'

Doing a search for 'iconv TRANSLIT python' I found: http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/ which looks like it might be what you need. The comments have some other ideas which use the standard library instead.

There's also http://web.archive.org/web/20070807224749/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python/ which uses NFKD to get the base characters where possible.

Read the answers to some of the duplicate questions. The NFKD gimmick works only as an accent stripper. It doesn't handle ligatures and lots of other Latin-based characters that can't be (or aren't) decomposed. For this a prepared translation table is necessary (and much faster).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM