Python正则表达式将字符串中的非ascii字符转换为最接近的ascii等价物

Question

I'm seeking simple Python function that takes a string and returns a similar one but with all non-ascii characters converted to their closest ascii equivalent. 我正在寻找简单的Python函数，它接受一个字符串并返回一个类似的字符串，但所有非ascii字符转换为它们最接近的ascii等价物。 For example, diacritics and whatnot should be dropped. 例如，应该删除变音符号等。 I'm imagining there must be a pretty canonical way to do this and there are plenty of related stackoverflow questions but I'm not finding a simple answer so it seemed worth a separate question. 我想象必须有一个非常规范的方法来做这个并且有很多相关的stackoverflow问题，但我找不到一个简单的答案所以它似乎值得一个单独的问题。

Example input/output: 输入/输出示例：

"Étienne" -> "Etienne"

Answer 1

Reading this question made me go looking for something better. 读这个问题让我去寻找更好的东西。

https://pypi.python.org/pypi/Unidecode/0.04.1 https://pypi.python.org/pypi/Unidecode/0.04.1

Does exactly what you ask for. 完全符合你的要求。

Answer 2

In Python 3 and using the regex implementation at PyPI: 在Python 3中并在PyPI上使用正则表达式实现：

http://pypi.python.org/pypi/regex

Starting with the string: 从字符串开始：

>>> s = "Étienne"

Normalise to NFKD and then remove the diacritics: 归一化为NFKD，然后删除变音符号：

>>> import unicodedata
>>> import regex
>>> regex.sub(r"\p{Mn}", "", unicodedata.normalize("NFKD", s))
'Etienne'

Answer 3

Doing a search for 'iconv TRANSLIT python' I found: http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/ which looks like it might be what you need. 搜索'iconv TRANSLIT python'我发现： http ：//www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/看起来可能就是你所需要的。 The comments have some other ideas which use the standard library instead. 这些评论还有其他一些使用标准库的想法。

There's also http://web.archive.org/web/20070807224749/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python/ which uses NFKD to get the base characters where possible. 还有http://web.archive.org/web/20070807224749/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python/使用NFKD获取基本字符在可能的情况。

Answer 4

Read the answers to some of the duplicate questions. 阅读一些重复问题的答案。 The NFKD gimmick works only as an accent stripper. NFKD噱头仅作为重点剥离器。 It doesn't handle ligatures and lots of other Latin-based characters that can't be (or aren't) decomposed. 它不处理连字和许多其他不能（或没有）分解的基于拉丁语的字符。 For this a prepared translation table is necessary (and much faster). 为此，准备好的翻译表是必要的（并且更快）。

Python正则表达式将字符串中的非ascii字符转换为最接近的ascii等价物

问题描述

4 个解决方案

解决方案1
4 2013-11-04 14:32:02

解决方案2
2 2011-04-02 01:24:03

解决方案3
1 2010-10-04 10:11:09

解决方案4
1 2011-04-02 01:30:34

Python正则表达式将字符串中的非ascii字符转换为最接近的ascii等价物

问题描述

4 个解决方案

解决方案1 4 2013-11-04 14:32:02

解决方案2 2 2011-04-02 01:24:03

解决方案3 1 2010-10-04 10:11:09

解决方案4 1 2011-04-02 01:30:34

解决方案1
4 2013-11-04 14:32:02

解决方案2
2 2011-04-02 01:24:03

解决方案3
1 2010-10-04 10:11:09

解决方案4
1 2011-04-02 01:30:34