python中的字符串替換性能

Question

我有一個約50,000個字符串（標題）的列表，以及從這些標題中刪除的約150個字的列表（如果找到它們）。 到目前為止，我的代碼如下。 最終輸出應該是50,000個字符串的列表，其中刪除了150個單詞的所有實例。 我想知道這樣做最有效（表現明智）的方式。 我的代碼似乎在運行，雖然速度很慢..

excludes = GetExcludes()
titles = GetTitles()
titles_alpha = []
titles_excl = []
for k in range(len(titles)):
    #remove all non-alphanumeric characters 
    s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])

    #remove extra white space
    s = re.sub( '\s+', ' ', s).strip()

    #lowercase
    s = s.lower()

    titles_alpha.append(s)
    #remove any excluded words


    for i in range (len(excludes)):
        titles_excl.append(titles_alpha[k].replace(excludes[i],''))

print titles_excl

Answer 1

正則表達式的許多性能開銷來自編譯正則表達式。 您應該將正則表達式的編譯移出循環。

這應該會給你一個相當大的改進：

pattern1 = re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile('\s+')
for k in range(len(titles)):
    #remove all non-alphanumeric characters 
    s = re.sub(pattern1,' ',titles[k])

    #remove extra white space
    s = re.sub(pattern2,' ', s).strip()

有一些測試wordlist.txt從這里：

import re
def noncompiled():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])

        #remove extra white space
        s = re.sub( '\s+', ' ', s).strip()

def compiled():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    pattern1=re.compile('[^0-9a-zA-Z]+')
    pattern2 = re.compile( '\s+')
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = pattern1.sub('',titles[k])

        #remove extra white space
        s = pattern2.sub('', s)



In [2]: %timeit noncompiled()
1 loops, best of 3: 292 ms per loop

In [3]: %timeit compiled()
10 loops, best of 3: 176 ms per loop

要從排除列表中刪除“壞詞”，您應該像@zsquare建議創建一個聯合正則表達式，這很可能是您可以獲得的最快速度。

def with_excludes():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    pattern1=re.compile('[^0-9a-zA-Z]+')
    pattern2 = re.compile( '\s+')
    excludes = ["shit","poo","ass","love","boo","ch"]
    excludes_regex = re.compile('|'.join(excludes))
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = pattern1.sub('',titles[k])

        #remove extra white space
        s = pattern2.sub('', s)
        #remove bad words
        s = pattern2.sub('', s)
In [2]: %timeit with_excludes()
1 loops, best of 3: 251 ms per loop

只需編譯一個主正則表達式，您就可以進一步采用這種方法：

def master():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    excludes = ["shit","poo","ass","love","boo","ch"]
    nonalpha='[^0-9a-zA-Z]+'
    whitespace='\s+'
    badwords = '|'.join(excludes)
    master_regex=re.compile('|'.join([nonalpha,whitespace,badwords]))

    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = master_regex.sub('',titles[k])
In [2]: %timeit master()
10 loops, best of 3: 148 ms per loop

通過避免python中的迭代，您可以獲得更快的速度：

    result = [master_regex.sub('',item) for item in titles]


In [4]: %timeit list_comp()
10 loops, best of 3: 139 ms per loop

注意：數據生成步驟：

def baseline():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]

In [2]: %timeit baseline()
10 loops, best of 3: 24.8 ms per loop

Answer 2

一種方法是動態創建被排除單詞的正則表達式並在列表中替換它們。

就像是：

excludes_regex = re.compile('|'.join(excludes))
titles_excl = []
for title in titles:
    titles_excl.append(excludes_regex.sub('', title))

python中的字符串替換性能

問題描述

2 個解決方案

解決方案1
8 已采納 2015-11-23 13:26:19

解決方案2
2 2015-11-23 13:30:10

python中的字符串替換性能

問題描述

2 個解決方案

解決方案1 8 已采納 2015-11-23 13:26:19

解決方案2 2 2015-11-23 13:30:10

解決方案1
8 已采納 2015-11-23 13:26:19

解決方案2
2 2015-11-23 13:30:10