[英]Fastest check if line starts with value in list?
我有成千上萬的值 (作為列表但可能轉換為字典,如果這有幫助),並希望與具有數百萬行的文件進行比較 。 我想要做的是將文件中的行過濾為僅以列表中的值開頭的行 。
最快的方法是什么?
我的慢代碼:
for line in source_file:
# Go through all IDs
for id in my_ids:
if line.startswith(str(id) + "|"):
#replace comas with semicolons and pipes with comas
target_file.write(line.replace(",",";").replace("|",","))
如果你確定該行以id +“|”和“|”開頭 不會出現在id中,我想你可以用“|”來玩一些技巧。 例如:
my_id_strs = map(str, my_ids)
for line in source_file:
first_part = line.split("|")[0]
if first_part in my_id_strs:
target_file.write(line.replace(",",";").replace("|",","))
希望這會有所幫助:)
使用string.translate
進行替換。 你也可以在匹配id后休息一下。
from string import maketrans
trantab = maketrans(",|", ";,")
ids = ['%d|' % id for id in my_ids]
for line in source_file:
# Go through all IDs
for id in ids:
if line.startswith(id):
#replace comas with semicolons and pipes with comas
target_file.write(line.translate(trantab))
break
要么
from string import maketrans
#replace comas with semicolons and pipes with comas
trantab = maketrans(",|", ";,")
idset = set(my_ids)
for line in source_file:
try:
if line[:line.index('|')] in idset:
target_file.write(line.translate(trantab))
except ValueError as ve:
pass
使用正則表達式。 這是一個實現:
import re
def filterlines(prefixes, lines):
pattern = "|".join([re.escape(p) for p in prefixes])
regex = re.compile(pattern)
for line in lines:
if regex.match(line):
yield line
我們首先構建並編譯一個正則表達式(昂貴,但只有一次),但匹配非常非常快。
以上測試代碼:
with open("/usr/share/dict/words") as words:
prefixes = [line.strip() for line in words]
lines = [
"zoo this should match",
"000 this shouldn't match",
]
print(list(filterlines(prefixes, lines)))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.