简体   繁体   English

python中的字符串替换性能

[英]String substitution performance in python

I have a list of ~50,000 strings (titles), and a list of ~150 words to remove from these titles, if they are found. 我有一个约50,000个字符串(标题)的列表,以及从这些标题中删除的约150个字的列表(如果找到它们)。 My code so far is below. 到目前为止,我的代码如下。 The final output should be the list of 50,000 strings, with all instances of the 150 words removed. 最终输出应该是50,000个字符串的列表,其中删除了150个单词的所有实例。 I would like to know the most efficient (performance wise) way of doing this. 我想知道这样做最有效(表现明智)的方式。 My code seems to be running, albeit very slowly .. 我的代码似乎在运行,虽然速度很慢..

excludes = GetExcludes()
titles = GetTitles()
titles_alpha = []
titles_excl = []
for k in range(len(titles)):
    #remove all non-alphanumeric characters 
    s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])

    #remove extra white space
    s = re.sub( '\s+', ' ', s).strip()

    #lowercase
    s = s.lower()

    titles_alpha.append(s)
    #remove any excluded words


    for i in range (len(excludes)):
        titles_excl.append(titles_alpha[k].replace(excludes[i],''))

print titles_excl

A lot of the performance overhead of regular expressions comes from compiling the regular expressions. 正则表达式的许多性能开销来自编译正则表达式。 You should move the compilation of the regular expression out of the loop. 您应该将正则表达式的编译移出循环。

This should give you a considerable improvement: 这应该会给你一个相当大的改进:

pattern1 = re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile('\s+')
for k in range(len(titles)):
    #remove all non-alphanumeric characters 
    s = re.sub(pattern1,' ',titles[k])

    #remove extra white space
    s = re.sub(pattern2,' ', s).strip()

Some tests with wordlist.txt from here : 有一些测试wordlist.txt这里

import re
def noncompiled():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])

        #remove extra white space
        s = re.sub( '\s+', ' ', s).strip()

def compiled():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    pattern1=re.compile('[^0-9a-zA-Z]+')
    pattern2 = re.compile( '\s+')
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = pattern1.sub('',titles[k])

        #remove extra white space
        s = pattern2.sub('', s)



In [2]: %timeit noncompiled()
1 loops, best of 3: 292 ms per loop

In [3]: %timeit compiled()
10 loops, best of 3: 176 ms per loop

To remove the "bad words" from your excludes list, you should as @zsquare suggested create a joined regex, which will most likely be the fastest that you can get. 要从排除列表中删除“坏词”,您应该像@zsquare建议创建一个联合正则表达式,这很可能是您可以获得的最快速度。

def with_excludes():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    pattern1=re.compile('[^0-9a-zA-Z]+')
    pattern2 = re.compile( '\s+')
    excludes = ["shit","poo","ass","love","boo","ch"]
    excludes_regex = re.compile('|'.join(excludes))
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = pattern1.sub('',titles[k])

        #remove extra white space
        s = pattern2.sub('', s)
        #remove bad words
        s = pattern2.sub('', s)
In [2]: %timeit with_excludes()
1 loops, best of 3: 251 ms per loop

You can take this approach one step further by just compiling a master regex: 只需编译一个主正则表达式,您就可以进一步采用这种方法:

def master():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    excludes = ["shit","poo","ass","love","boo","ch"]
    nonalpha='[^0-9a-zA-Z]+'
    whitespace='\s+'
    badwords = '|'.join(excludes)
    master_regex=re.compile('|'.join([nonalpha,whitespace,badwords]))

    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = master_regex.sub('',titles[k])
In [2]: %timeit master()
10 loops, best of 3: 148 ms per loop

You can gain some more speed by avoiding the iteration in python: 通过避免python中的迭代,您可以获得更快的速度:

    result = [master_regex.sub('',item) for item in titles]


In [4]: %timeit list_comp()
10 loops, best of 3: 139 ms per loop

Note: The data generation step: 注意:数据生成步骤:

def baseline():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]

In [2]: %timeit baseline()
10 loops, best of 3: 24.8 ms per loop

One way to do this would be to dynamically create a regex of your excluded words and replace them in the list. 一种方法是动态创建被排除单词的正则表达式并在列表中替换它们。

Something like: 就像是:

excludes_regex = re.compile('|'.join(excludes))
titles_excl = []
for title in titles:
    titles_excl.append(excludes_regex.sub('', title))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM