遍歷文件每一行的最有效方法是什么？

Question

我有一個文件dataset.nt，它不是太大（300Mb）。 我還有一個列表，其中包含約500個元素。 對於列表的每個元素，我想計算包含該列表的文件中的行數，並將該鍵/值對添加到字典中（鍵是列表元素的名稱，該值是次數該元素出現在文件中）。

這是我為達到該結果而煩惱的第一件事：

mydict = {}

for i in mylist:
    regex = re.compile(r"/Main/"+re.escape(i))
    total = 0
    with open("dataset.nt", "rb") as input:
        for line in input:
            if regex.search(line):
                total = total+1
    mydict[i] = total

它不起作用（例如，它可以無限期運行），我認為我應該找到一種方法，不讀取每行500次。 所以我嘗試了這個：

mydict = {}

with open("dataset.nt", "rb") as input:
    for line in input:
        for i in mylist:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

性能沒有提高，腳本仍然可以無限期運行。 因此，我在Google上四處搜尋，並嘗試了以下方法：

mydict = {}

file = open("dataset.nt", "rb")

while 1:
    lines = file.readlines(100000)
    if not lines:
        break
    for line in lines:
        for i in list:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

那台機器已經運行了30分鍾，所以我認為它沒有任何改善。

我應該如何構造此代碼，以使其在合理的時間內完成？

Answer 1

我希望您的第二個版本稍作修改：

mydict = {}

re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if not '/Main/' in line:
            continue 

        # do the regex-part
        for i, regex in zip(mylist, re_list):
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

正如@matsjoyce已經建議的那樣，這避免了在每次迭代時重新編譯正則表達式。 如果您確實需要那么多種不同的正則表達式模式，那么我認為您無能為力。

也許值得檢查一下是否可以對“ / Main /”后面的內容進行正則表達式捕獲，然后將其與列表進行比較。 這可能有助於減少“真實”正則表達式搜索的數量。

Answer 2

看起來像是一些地圖/歸約方法（例如並行化）的不錯選擇。您可以將數據集文件拆分為N個塊（其中N =您有多少個處理器），啟動N個子進程，每個子進程掃描一個塊，然后對結果求和。

當然，這不會阻止您首先優化掃描，即（基於sebastian的代碼）：

targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)

with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if '/Main/' not in line:
            continue 

        # do the regex-part
        for i, regex in targets:
            if regex.search(line):
                results[i] += 1

請注意，如果您從數據集中發布了樣本，則可以更好地進行優化。 例如，如果您的數據集可以在“ / Main / {i}”上sort （例如，使用系統sort程序），則不必檢查i每個值的每一行。 或者，如果該行中“ / Main /”的位置是已知的並且是固定的，則可以在字符串的相關部分上使用簡單的字符串比較（這可能比regexp更快）。

Answer 3

其他解決方案都很好。 但是，由於每個元素都有一個正則表達式，並且如果該元素每行出現不止一次並不重要，則可以使用re.findall計算包含目標表達式的行。

同樣，在經過一定數量的行之后，最好將破洞文件（如果您有足夠的內存並且不是設計限制）讀取到內存。

    import re

    mydict = {}
    mylist = [...] # A list with 500 items
    # Optimizing calls
    findall = re.findall  # Then python don't have to resolve this functions for every call
    escape = re.escape

    with open("dataset.nt", "rb") as input:
        text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
        for elem in mylist:
            mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.

我用大小為800Mb的文件進行了測試（我想看看將像這樣大的文件加載到內存中要花多少時間，這比您想象的要快得多）。

我不使用真正的數據來測試整個代碼，而只是使用findall部分。

遍歷文件每一行的最有效方法是什么？

問題描述

3 個解決方案

解決方案1
1 已采納 2014-10-29 17:44:37

解決方案2
0 2014-10-29 18:02:04

解決方案3
0 2014-10-29 18:27:23

遍歷文件每一行的最有效方法是什么？

問題描述

3 個解決方案

解決方案1 1 已采納 2014-10-29 17:44:37

解決方案2 0 2014-10-29 18:02:04

解決方案3 0 2014-10-29 18:27:23

解決方案1
1 已采納 2014-10-29 17:44:37

解決方案2
0 2014-10-29 18:02:04

解決方案3
0 2014-10-29 18:27:23