最快檢查行是否以列表中的值開頭？

Question

我有成千上萬的值 （作為列表但可能轉換為字典，如果這有幫助），並希望與具有數百萬行的文件進行比較。 我想要做的是將文件中的行過濾為僅以列表中的值開頭的行 。

最快的方法是什么？

我的慢代碼：

  for line in source_file:
    # Go through all IDs
    for id in my_ids:
      if line.startswith(str(id) + "|"):
        #replace comas with semicolons and pipes with comas
        target_file.write(line.replace(",",";").replace("|",","))

Answer 1

如果你確定該行以id +“|”和“|”開頭 不會出現在id中，我想你可以用“|”來玩一些技巧。 例如：

my_id_strs = map(str, my_ids)
for line in source_file:
    first_part = line.split("|")[0]
    if first_part in my_id_strs:
        target_file.write(line.replace(",",";").replace("|",","))

希望這會有所幫助:)

Answer 2

使用string.translate進行替換。 你也可以在匹配id后休息一下。

from string import maketrans

trantab = maketrans(",|", ";,")

ids = ['%d|' % id for id in my_ids]

for line in source_file:
    # Go through all IDs
    for id in ids:
      if line.startswith(id):
        #replace comas with semicolons and pipes with comas
        target_file.write(line.translate(trantab))
        break

要么

from string import maketrans

#replace comas with semicolons and pipes with comas
trantab = maketrans(",|", ";,")
idset = set(my_ids)

for line in source_file:
    try:
        if line[:line.index('|')] in idset:            
            target_file.write(line.translate(trantab))
    except ValueError as ve:
        pass

Answer 3

使用正則表達式。 這是一個實現：

import re

def filterlines(prefixes, lines):
    pattern = "|".join([re.escape(p) for p in prefixes])
    regex = re.compile(pattern)
    for line in lines:
        if regex.match(line):
            yield line

我們首先構建並編譯一個正則表達式（昂貴，但只有一次），但匹配非常非常快。

以上測試代碼：

with open("/usr/share/dict/words") as words:
    prefixes = [line.strip() for line in words]

lines = [
    "zoo this should match",
    "000 this shouldn't match",
]

print(list(filterlines(prefixes, lines)))

最快檢查行是否以列表中的值開頭？

問題描述

3 個解決方案

解決方案1
3 已采納 2015-11-10 05:35:58

解決方案2
1 2015-11-10 06:17:22

解決方案3
0 2015-11-10 05:58:49

最快檢查行是否以列表中的值開頭？

問題描述

3 個解決方案

解決方案1 3 已采納 2015-11-10 05:35:58

解決方案2 1 2015-11-10 06:17:22

解決方案3 0 2015-11-10 05:58:49

解決方案1
3 已采納 2015-11-10 05:35:58

解決方案2
1 2015-11-10 06:17:22

解決方案3
0 2015-11-10 05:58:49