简体   繁体   English

最快检查行是否以列表中的值开头?

[英]Fastest check if line starts with value in list?

I have thousands of values (as list but might convert to dictionary or so if that helps) and want to compare to files with millions of lines . 我有成千上万的值 (作为列表但可能转换为字典,如果这有帮助),并希望与具有数百万行的文件进行比较 What I want to do is to filter lines in files to only the ones starting with values in the list . 我想要做的是将文件中的行过滤为仅以列表中的值开头的行

What is the fastest way to do it? 最快的方法是什么?

My slow code: 我的慢代码:

  for line in source_file:
    # Go through all IDs
    for id in my_ids:
      if line.startswith(str(id) + "|"):
        #replace comas with semicolons and pipes with comas
        target_file.write(line.replace(",",";").replace("|",","))

If you sure the line starts with id + "|", and "|" 如果你确定该行以id +“|”和“|”开头 will not present in id, I think you could play some trick with "|". 不会出现在id中,我想你可以用“|”来玩一些技巧。 For example: 例如:

my_id_strs = map(str, my_ids)
for line in source_file:
    first_part = line.split("|")[0]
    if first_part in my_id_strs:
        target_file.write(line.replace(",",";").replace("|",","))

Hope this will help :) 希望这会有所帮助:)

Use string.translate to do replace. 使用string.translate进行替换。 Also you can do a break after you match the id. 你也可以在匹配id后休息一下。

from string import maketrans

trantab = maketrans(",|", ";,")

ids = ['%d|' % id for id in my_ids]

for line in source_file:
    # Go through all IDs
    for id in ids:
      if line.startswith(id):
        #replace comas with semicolons and pipes with comas
        target_file.write(line.translate(trantab))
        break

or 要么

from string import maketrans

#replace comas with semicolons and pipes with comas
trantab = maketrans(",|", ";,")
idset = set(my_ids)

for line in source_file:
    try:
        if line[:line.index('|')] in idset:            
            target_file.write(line.translate(trantab))
    except ValueError as ve:
        pass

Use a regular expression. 使用正则表达式。 Here is an implementation: 这是一个实现:

import re

def filterlines(prefixes, lines):
    pattern = "|".join([re.escape(p) for p in prefixes])
    regex = re.compile(pattern)
    for line in lines:
        if regex.match(line):
            yield line

We build and compile a regular expression first (expensive, but once only), but then the matching is very, very fast. 我们首先构建并编译一个正则表达式(昂贵,但只有一次),但匹配非常非常快。

Test code for the above: 以上测试代码:

with open("/usr/share/dict/words") as words:
    prefixes = [line.strip() for line in words]

lines = [
    "zoo this should match",
    "000 this shouldn't match",
]

print(list(filterlines(prefixes, lines)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM