从Python中的unicode字符串中删除标点符号的最快方法

Question

I am trying to efficiently strip punctuation from a unicode string. 我试图有效地从unicode字符串中删除标点符号。 With a regular string, using mystring.translate(None, string.punctuation) is clearly the fastest approach . 使用常规字符串，使用mystring.translate(None, string.punctuation)显然是最快的方法。 However, this code breaks on a unicode string in Python 2.7. 但是，此代码在Python 2.7中打破了unicode字符串。 As the comments to this answer explain, the translate method can still be implemented, but it must be implement with a dictionary. 正如对这个答案的评论所解释的那样，翻译方法仍然可以实现，但必须用字典来实现。 When I use this implementation though, I find that translate's performance is dramatically reduced. 当我使用这个实现时，我发现translate的性能大大降低了。 Here is my timing code (copied primarily from this answer ): 这是我的计时代码（主要从这个答案复制）：

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = "For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."
su = u"For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."


exclude = set(string.punctuation)
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(None, string.punctuation)

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))

def test_trans_unicode(su):
    return su.translate(tbl)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

print "sets (unicode)      :",timeit.Timer('f(su)', 'from __main__ import su,test_set as f').timeit(1000000)
print "regex (unicode)     :",timeit.Timer('f(su)', 'from __main__ import su,test_re as f').timeit(1000000)
print "translate (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_trans_unicode as f').timeit(1000000)
print "replace (unicode)   :",timeit.Timer('f(su)', 'from __main__ import su,test_repl as f').timeit(1000000)

As my results show, the unicode implementation of translate performs horribly: 正如我的结果所示，翻译的unicode实现可怕地执行：

sets      : 38.323941946
regex     : 6.7729549408
translate : 1.27428412437
replace   : 5.54967689514

sets (unicode)      : 43.6268708706
regex (unicode)     : 7.32343912125
translate (unicode) : 54.0041439533
replace (unicode)   : 17.4450061321

My question is whether there is a faster way to implement translate for unicode (or any other method) that would outperform regex. 我的问题是，是否有更快的方法来实现优于正则表达式的unicode（或任何其他方法）的翻译。

Answer 1

The current test script is flawed, because it does not compare like with like. 目前的测试脚本是有缺陷的，因为它不像是喜欢。

For a fairer comparison, all the functions must be run with the same set of punctuation characters (ie either all ascii, or all unicode). 为了更公平的比较，所有函数必须使用相同的标点符号集（即所有ascii或所有unicode）运行。

When that is done, the regex and replace methods fare much worse with the full set of unicode punctuation characters. 如果做到这一点，正则表达式和替换方法票价与全套的Unicode标点符号更糟。

For full unicode, it looks like the "set" method is the best. 对于完整的unicode，看起来“set”方法是最好的。 However, if you only want remove the ascii punctuation characters from unicode strings, it may be best to encode, translate, and decode (depending on the length of the input string). 但是，如果您只想从unicode字符串中删除ascii标点符号，则最好进行编码，转换和解码（取决于输入字符串的长度）。

The "replace" method can also be substantially improved by doing a containment test before attempting replacements (depending on the precise make-up of the string). 通过在尝试更换之前进行包容测试（取决于弦的精确构成），也可以显着改善“替换”方法。

Here's some sample results from a re-hash of the test script: 以下是测试脚本重新哈希的一些示例结果：

$ python2 test.py
running ascii punctuation test...
using byte strings...

set: 0.862006902695
re: 0.17484498024
trans: 0.0207080841064
enc_trans: 0.0206489562988
repl: 0.157525062561
in_repl: 0.213351011276

$ python2 test.py a
running ascii punctuation test...
using unicode strings...

set: 0.927773952484
re: 0.18892288208
trans: 1.58275294304
enc_trans: 0.0794939994812
repl: 0.413739919662
in_repl: 0.249747991562

python2 test.py u
running unicode punctuation test...
using unicode strings...

set: 0.978360176086
re: 7.97941994667
trans: 1.72471117973
enc_trans: 0.0784001350403
repl: 7.05612301826
in_repl: 3.66821289062

And here's the re-hashed script: 这是重新散列的脚本：

# -*- coding: utf-8 -*-

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = """For me, Reddit brings to mind Obi Wan’s enduring description of the Mos
Eisley cantina: a wretched hive of scum and villainy. But, you know, one you
still kinda want to hang out in occasionally. The thing is, though, Reddit
isn’t some obscure dive bar in a remote corner of the universe—it’s a huge
watering hole at the very center of it. The site had some 400 million unique
visitors in 2012. They can’t all be Greedos. So maybe my problem is just that
I’ve never been able to find the places where the decent people hang out."""

su = u"""For me, Reddit brings to mind Obi Wan’s enduring description of the
Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one
you still kinda want to hang out in occasionally. The thing is, though,
Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a
huge watering hole at the very center of it. The site had some 400 million
unique visitors in 2012. They can’t all be Greedos. So maybe my problem is
just that I’ve never been able to find the places where the decent people
hang out."""

def test_trans(s):
    return s.translate(tbl)

def test_enc_trans(s):
    s = s.encode('utf-8').translate(None, string.punctuation)
    return s.decode('utf-8')

def test_set(s): # with list comprehension fix
    return ''.join([ch for ch in s if ch not in exclude])

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_repl(s):  # From S.Lott's solution
    for c in punc:
        s = s.replace(c, "")
    return s

def test_in_repl(s):  # From S.Lott's solution, with fix
    for c in punc:
        if c in s:
            s = s.replace(c, "")
    return s

txt = 'su'
ptn = u'[%s]'

if 'u' in sys.argv[1:]:
    print 'running unicode punctuation test...'
    print 'using unicode strings...'
    punc = u''
    tbl = {}
    for i in xrange(sys.maxunicode):
        char = unichr(i)
        if unicodedata.category(char).startswith('P'):
            tbl[i] = None
            punc += char
else:
    print 'running ascii punctuation test...'
    punc = string.punctuation
    if 'a' in sys.argv[1:]:
        print 'using unicode strings...'
        punc = punc.decode()
        tbl = {ord(ch):None for ch in punc}
    else:
        print 'using byte strings...'
        txt = 's'
        ptn = '[%s]'
        def test_trans(s):
            return s.translate(None, punc)
        test_enc_trans = test_trans

exclude = set(punc)
regex = re.compile(ptn % re.escape(punc))

def time_func(func, n=10000):
    timer = timeit.Timer(
        'func(%s)' % txt,
        'from __main__ import %s, test_%s  as func' % (txt, func))
    print '%s: %s' % (func, timer.timeit(n))

print
time_func('set')
time_func('re')
time_func('trans')
time_func('enc_trans')
time_func('repl')
time_func('in_repl')

从Python中的unicode字符串中删除标点符号的最快方法

问题描述

1 个解决方案

解决方案1
6 2013-12-12 03:30:09

从Python中的unicode字符串中删除标点符号的最快方法

问题描述

1 个解决方案

解决方案1 6 2013-12-12 03:30:09

解决方案1
6 2013-12-12 03:30:09