[英]String substitution performance in python
我有一個約50,000個字符串(標題)的列表,以及從這些標題中刪除的約150個字的列表(如果找到它們)。 到目前為止,我的代碼如下。 最終輸出應該是50,000個字符串的列表,其中刪除了150個單詞的所有實例。 我想知道這樣做最有效(表現明智)的方式。 我的代碼似乎在運行,雖然速度很慢..
excludes = GetExcludes()
titles = GetTitles()
titles_alpha = []
titles_excl = []
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])
#remove extra white space
s = re.sub( '\s+', ' ', s).strip()
#lowercase
s = s.lower()
titles_alpha.append(s)
#remove any excluded words
for i in range (len(excludes)):
titles_excl.append(titles_alpha[k].replace(excludes[i],''))
print titles_excl
正則表達式的許多性能開銷來自編譯正則表達式。 您應該將正則表達式的編譯移出循環。
這應該會給你一個相當大的改進:
pattern1 = re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile('\s+')
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = re.sub(pattern1,' ',titles[k])
#remove extra white space
s = re.sub(pattern2,' ', s).strip()
有一些測試wordlist.txt
從這里 :
import re
def noncompiled():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])
#remove extra white space
s = re.sub( '\s+', ' ', s).strip()
def compiled():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
pattern1=re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile( '\s+')
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = pattern1.sub('',titles[k])
#remove extra white space
s = pattern2.sub('', s)
In [2]: %timeit noncompiled()
1 loops, best of 3: 292 ms per loop
In [3]: %timeit compiled()
10 loops, best of 3: 176 ms per loop
要從排除列表中刪除“壞詞”,您應該像@zsquare建議創建一個聯合正則表達式,這很可能是您可以獲得的最快速度。
def with_excludes():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
pattern1=re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile( '\s+')
excludes = ["shit","poo","ass","love","boo","ch"]
excludes_regex = re.compile('|'.join(excludes))
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = pattern1.sub('',titles[k])
#remove extra white space
s = pattern2.sub('', s)
#remove bad words
s = pattern2.sub('', s)
In [2]: %timeit with_excludes()
1 loops, best of 3: 251 ms per loop
只需編譯一個主正則表達式,您就可以進一步采用這種方法:
def master():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
excludes = ["shit","poo","ass","love","boo","ch"]
nonalpha='[^0-9a-zA-Z]+'
whitespace='\s+'
badwords = '|'.join(excludes)
master_regex=re.compile('|'.join([nonalpha,whitespace,badwords]))
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = master_regex.sub('',titles[k])
In [2]: %timeit master()
10 loops, best of 3: 148 ms per loop
通過避免python中的迭代,您可以獲得更快的速度:
result = [master_regex.sub('',item) for item in titles]
In [4]: %timeit list_comp()
10 loops, best of 3: 139 ms per loop
注意:數據生成步驟:
def baseline():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
In [2]: %timeit baseline()
10 loops, best of 3: 24.8 ms per loop
一種方法是動態創建被排除單詞的正則表達式並在列表中替換它們。
就像是:
excludes_regex = re.compile('|'.join(excludes))
titles_excl = []
for title in titles:
titles_excl.append(excludes_regex.sub('', title))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.