使用Python剥离和替换高unicode字符文档的最快方法是什么？

Question

I am looking to replace from a large document all high unicode characters, such as accented Es, left and right quotes, etc., with "normal" counterparts in the low range, such as a regular 'E', and straight quotes. 我希望从大文档中替换所有高级unicode字符，例如重音Es，左右引号等，以及低范围中的“普通”对应字符，例如常规“E”和直引号。 I need to perform this on a very large document rather often. 我需要经常在非常大的文档上执行此操作。 I see an example of this in what I think might be perl here: http://www.designmeme.com/mtplugins/lowdown.txt 我在这里看到了一个这样的例子： http ： //www.designmeme.com/mtplugins/lowdown.txt

Is there a fast way of doing this in Python without using s.replace(...).replace(...).replace(...)...? 有没有一种快速的方法在Python中执行此操作而不使用s.replace（...）。replace（...）。replace（...）...？ I've tried this on just a few characters to replace and the document stripping became really slow. 我已经尝试过几个字符来替换，文档剥离变得非常慢。

EDIT, my version of unutbu's code that doesn't seem to work: 编辑，我的unutbu代码版本似乎不起作用：

# -*- coding: iso-8859-15 -*-
import unidecode
def ascii_map():
    data={}
    for num in range(256):
        h=num
        filename='x{num:02x}'.format(num=num)
        try:
            mod = __import__('unidecode.'+filename,
                             fromlist=True)
        except ImportError:
            pass
        else:
            for l,val in enumerate(mod.data):
                i=h<<8
                i+=l
                if i >= 0x80:
                    data[i]=unicode(val)
    return data

if __name__=='__main__':
    s = u'“fancy“fancy2'
    print(s.translate(ascii_map()))

Answer 1

# -*- encoding: utf-8 -*-
import unicodedata

def shoehorn_unicode_into_ascii(s):
    return unicodedata.normalize('NFKD', s).encode('ascii','ignore')

if __name__=='__main__':
    s = u"éèêàùçÇ"
    print(shoehorn_unicode_into_ascii(s))
    # eeeaucC

Note, as @Mark Tolonen kindly points out, the method above removes some characters like ß''“”. 注意，正如@Mark Tolonen所指出的，上面的方法删除了一些像ß''“”这样的字符。 If the above code truncates characters that you wish translated, then you may have to use the string's translate method to manually fix these problems. 如果上面的代码截断了您希望翻译的字符，那么您可能必须使用字符串的translate方法来手动修复这些问题。 Another option is to use unidecode (see JF Sebastian's answer ). 另一种选择是使用unidecode （参见JF Sebastian的回答）。

When you have a large unicode string, using its translate method will be much much faster than using the replace method. 如果你有一个大的unicode字符串，使用它的translate方法将比使用replace方法快得多。

Edit: unidecode has a more complete mapping of unicode codepoints to ascii. 编辑： unidecode有一个更完整的unicode代码点映射到ascii。 However, unidecode.unidecode loops through the string character-by-character (in a Python loop), which is slower than using the translate method. 但是， unidecode.unidecode循环遍历字符串（在Python循环中），这比使用translate方法慢。

The following helper function uses unidecode 's data files, and the translate method to attain better speed, especially for long strings. 以下辅助函数使用unidecode的数据文件，而translate方法可以获得更好的速度，特别是对于长字符串。

In my tests on 1-6 MB text files, using ascii_map is about 4-6 times faster than unidecode.unidecode . 在我对1-6 MB文本文件的测试中，使用ascii_map比unidecode.unidecode快4-6倍。

# -*- coding: utf-8 -*-
import unidecode
def ascii_map():
    data={}
    for num in range(256):
        h=num
        filename='x{num:02x}'.format(num=num)
        try:
            mod = __import__('unidecode.'+filename,
                             fromlist=True)
        except ImportError:
            pass
        else:
            for l,val in enumerate(mod.data):
                i=h<<8
                i+=l
                if i >= 0x80:
                    data[i]=unicode(val)
    return data

if __name__=='__main__':
    s = u"éèêàùçÇ"
    print(s.translate(ascii_map()))
    # eeeaucC

Edit2: Rhubarb, if # -*- encoding: utf-8 -*- is causing a SyntaxError, try # -*- encoding: cp1252 -*- . Edit2：Rhubarb，如果# -*- encoding: utf-8 -*-导致# -*- encoding: cp1252 -*- ，请尝试# -*- encoding: cp1252 -*- 。 What encoding to declare depends on what encoding your text editor uses to save the file. 要声明的编码取决于文本编辑器用于保存文件的编码。 Linux tends to use utf-8, and (it seems perhaps) Windows tends to cp1252. Linux倾向于使用utf-8，而且（似乎也许）Windows倾向于使用cp1252。

Answer 2

There is no such thing as a "high ascii character". 没有“高ascii字符”这样的东西。 The ASCII character set is limited to ordinal in range(128). ASCII字符集限制在范围内的序数（128）。

That aside, this is a FAQ. 除此之外，这是一个FAQ。 Here's one answer . 这是一个答案。 In general, you should familiarise yourself with str.translate() and unicode.translate() -- very handy for multiple substitutions of single bytes/characters. 通常，您应该熟悉str.translate（）和unicode.translate（） - 非常方便单个字节/字符的多个替换。 Beware of answers that mention only the unicodedata.normalize() gimmick; 谨防那些只提到unicodedata.normalize（）gimmick的答案; that's just one part of the solution. 这只是解决方案的一部分。

Update : The currently-accepted answer blows away characters that don't have a decomposition, as pointed out by Mark Tolonen. 更新：Mark Tolonen指出，当前接受的答案会消除没有分解的字符。 There seems to be a lack of knowledge of what unicode.translate() is capable of. 似乎缺乏unicode.translate()能够具备的知识。 It CAN translate one character into multiple characters. 它可以将一个字符转换为多个字符。 Here is the output from help(unicode.translate) : 以下是help(unicode.translate)的输出：

S.translate(table) -> unicode S.translate（表） - > unicode

Return a copy of the string S, where all characters have been mapped through the given translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. 返回字符串S的副本，其中所有字符都已通过给定的转换表进行映射，该转换表必须是Unicode序数到Unicode序列， Unicode字符串或None的映射。 Unmapped characters are left untouched. 未映射的字符保持不变。 Characters mapped to None are deleted. 映射到“无”的字符将被删除。

Here's an example: 这是一个例子：

>>> u"Gau\xdf".translate({0xdf: u"ss"})
u'Gauss'
>>>

Here's a table of fix-ups from the solution that I pointed to: 这是我指出的解决方案中的修复表：

CHAR_REPLACEMENT = {
    # latin-1 characters that don't have a unicode decomposition
    0xc6: u"AE", # LATIN CAPITAL LETTER AE
    0xd0: u"D",  # LATIN CAPITAL LETTER ETH
    0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE
    0xde: u"Th", # LATIN CAPITAL LETTER THORN
    0xdf: u"ss", # LATIN SMALL LETTER SHARP S
    0xe6: u"ae", # LATIN SMALL LETTER AE
    0xf0: u"d",  # LATIN SMALL LETTER ETH
    0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE
    0xfe: u"th", # LATIN SMALL LETTER THORN
    }

This can be easily extended to cater for the fancy quotes and other non-latin-1 characters found in cp1252 and siblings. 这可以很容易地扩展，以满足cp1252和兄弟姐妹中的花哨的引号和其他非拉丁字符。

Answer 3

I believe that unicodedata doesn't work for fancy quotes. 我相信unicodedata不适合花哨的报价。 You could use Unidecode in this case: 在这种情况下你可以使用Unidecode ：

import unidecode
print unidecode.unidecode(u"ß‘’“”")
# -> ss''""

Answer 4

If unicodedata.normalize() as suggested by ~unubtu doesn't do the trick, for example if you want more control over the mapping, you should look into 如果~unubtu建议的unicodedata.normalize（） ~unubtu问题，例如如果你想要更多地控制映射，你应该研究一下
str.translate() str.translate（）
along with str.maketrans() , a utility to produce a map table, str.translate is both efficient and convenient for this type of translation. 与str.maketrans（）（用于生成映射表的实用程序）一起， str.translate对于这种类型的转换既高效又方便。
In Python 2.x and for unicode strings one needs to use unicode.translate() rather than str.translate() and a trick similar to the one shown in the code snippet below, in lieu of maketrans(). 在Python 2.x和unicode字符串中，需要使用unicode.translate（）而不是str.translate（）和类似于下面代码片段中显示的技巧，而不是maketrans（）。 (thanks to John Machin for pointing this out!) （感谢John Machin指出这一点！）

These methods are also availble in in Python 3.x see for example the Python 3.1.2 documentation (for some reason I had made a mental note that this may have changed in Python 3.x). 这些方法也可以在Python 3.x中找到，例如参见Python 3.1.2文档（出于某种原因，我已经做了一个心理记录，这可能在Python 3.x中有所改变）。 Of course under Python 3, all strings are unicode strings, but that's other issue. 当然在Python 3下，所有字符串都是unicode字符串，但那是另一个问题。

#Python 3.1
>>> intab = 'àâçêèéïîôù'
>>> outtab = 'aaceeeiiou'
>>> tmap = str.maketrans(intab, outtab)
>>> s = "à la fête de l'été, où il fait bon danser, les Français font les drôles"
>>> s
"à la fête de l'été, où il fait bon danser, les Français font les drôles"
>>> s.translate(tmap)
"a la fete de l'ete, ou il fait bon danser, les Francais font les droles"
>>>


#Python 2.6
>>> intab = u'àâçêèéïîôù'
>>> outtab = u'aaceeeiiou'
>>> s = u"à la fête de l'été, où il fait bon danser, les Français font les drôles"
>>> #note the trick to replace maketrans() since for unicode strings the translation
>>> #     map expects integers (unicode ordinals) not characters.
>>> tmap = dict(zip(map(ord, intab), map(ord, outtab))) 
>>> s.translate(tmap)
u"a la fete de l'ete, ou il fait bon danser, les Francais font les droles"
>>>

Answer 5

Here's a solution that handles latin-1 characters (based on a 2003 usenet thread): 这是一个处理latin-1字符的解决方案（基于2003的usenet线程）：

>>> accentstable = str.join("", map(chr, range(192))) + "AAAAAAACEEEEIIIIDNOOOOOxOUUUUYTsaaaaaaaceeeeiiiidnooooo/ouuuuyty"
>>> import string
>>> s = u"éèêàùçÇ"
>>> print string.translate(s.encode('latin1', 'ignore'), accentstable)
eeeaucC

Some of the mappings aren't perfect eg Thorn maps to T rather than Th, but it does a tolerable job. 有些映射并不完美，例如Thorn映射到T而不是Th，但它确实可以容忍。

使用Python剥离和替换高unicode字符文档的最快方法是什么？

问题描述

5 个解决方案

解决方案1
7 已采纳

解决方案2
4 2010-05-18 02:39:57

解决方案3
3 2010-05-20 19:02:22

解决方案4
1 2010-05-18 02:40:06

解决方案5
0 2010-05-18 07:49:12

使用Python剥离和替换高unicode字符文档的最快方法是什么？

问题描述

5 个解决方案

解决方案1 7 已采纳

解决方案2 4 2010-05-18 02:39:57

解决方案3 3 2010-05-20 19:02:22

解决方案4 1 2010-05-18 02:40:06

解决方案5 0 2010-05-18 07:49:12

解决方案1
7 已采纳

解决方案2
4 2010-05-18 02:39:57

解决方案3
3 2010-05-20 19:02:22

解决方案4
1 2010-05-18 02:40:06

解决方案5
0 2010-05-18 07:49:12