遍历文件每一行的最有效方法是什么？

Question

我有一个文件dataset.nt，它不是太大（300Mb）。 我还有一个列表，其中包含约500个元素。 对于列表的每个元素，我想计算包含该列表的文件中的行数，并将该键/值对添加到字典中（键是列表元素的名称，该值是次数该元素出现在文件中）。

这是我为达到该结果而烦恼的第一件事：

mydict = {}

for i in mylist:
    regex = re.compile(r"/Main/"+re.escape(i))
    total = 0
    with open("dataset.nt", "rb") as input:
        for line in input:
            if regex.search(line):
                total = total+1
    mydict[i] = total

它不起作用（例如，它可以无限期运行），我认为我应该找到一种方法，不读取每行500次。 所以我尝试了这个：

mydict = {}

with open("dataset.nt", "rb") as input:
    for line in input:
        for i in mylist:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

性能没有提高，脚本仍然可以无限期运行。 因此，我在Google上四处搜寻，并尝试了以下方法：

mydict = {}

file = open("dataset.nt", "rb")

while 1:
    lines = file.readlines(100000)
    if not lines:
        break
    for line in lines:
        for i in list:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

那台机器已经运行了30分钟，所以我认为它没有任何改善。

我应该如何构造此代码，以使其在合理的时间内完成？

Answer 1

我希望您的第二个版本稍作修改：

mydict = {}

re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if not '/Main/' in line:
            continue 

        # do the regex-part
        for i, regex in zip(mylist, re_list):
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

正如@matsjoyce已经建议的那样，这避免了在每次迭代时重新编译正则表达式。 如果您确实需要那么多种不同的正则表达式模式，那么我认为您无能为力。

也许值得检查一下是否可以对“ / Main /”后面的内容进行正则表达式捕获，然后将其与列表进行比较。 这可能有助于减少“真实”正则表达式搜索的数量。

Answer 2

看起来像是一些地图/归约方法（例如并行化）的不错选择。您可以将数据集文件拆分为N个块（其中N =您有多少个处理器），启动N个子进程，每个子进程扫描一个块，然后对结果求和。

当然，这不会阻止您首先优化扫描，即（基于sebastian的代码）：

targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)

with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if '/Main/' not in line:
            continue 

        # do the regex-part
        for i, regex in targets:
            if regex.search(line):
                results[i] += 1

请注意，如果您从数据集中发布了样本，则可以更好地进行优化。 例如，如果您的数据集可以在“ / Main / {i}”上sort （例如，使用系统sort程序），则不必检查i每个值的每一行。 或者，如果该行中“ / Main /”的位置是已知的并且是固定的，则可以在字符串的相关部分上使用简单的字符串比较（这可能比regexp更快）。

Answer 3

其他解决方案都很好。 但是，由于每个元素都有一个正则表达式，并且如果该元素每行出现不止一次并不重要，则可以使用re.findall计算包含目标表达式的行。

同样，在经过一定数量的行之后，最好将破洞文件（如果您有足够的内存并且不是设计限制）读取到内存。

    import re

    mydict = {}
    mylist = [...] # A list with 500 items
    # Optimizing calls
    findall = re.findall  # Then python don't have to resolve this functions for every call
    escape = re.escape

    with open("dataset.nt", "rb") as input:
        text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
        for elem in mylist:
            mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.

我用大小为800Mb的文件进行了测试（我想看看将像这样大的文件加载到内存中要花多少时间，这比您想象的要快得多）。

我不使用真正的数据来测试整个代码，而只是使用findall部分。

遍历文件每一行的最有效方法是什么？

问题描述

3 个解决方案

解决方案1
1 已采纳 2014-10-29 17:44:37

解决方案2
0 2014-10-29 18:02:04

解决方案3
0 2014-10-29 18:27:23

遍历文件每一行的最有效方法是什么？

问题描述

3 个解决方案

解决方案1 1 已采纳 2014-10-29 17:44:37

解决方案2 0 2014-10-29 18:02:04

解决方案3 0 2014-10-29 18:27:23

解决方案1
1 已采纳 2014-10-29 17:44:37

解决方案2
0 2014-10-29 18:02:04

解决方案3
0 2014-10-29 18:27:23