Python-讀取大文件

Question

我有以下代碼嘗試處理具有多個xml元素的大型文件。

from shutil import copyfile
files_with_companies_mentions=[]
# code that reads the file line by line
def read_the_file(file_to_read):
    list_of_files_to_keep=[]
    f = open('huge_file.nml','r')
    lines=f.readlines()
    print("2. I GET HERE ")
    len_lines = len(lines)
    for i in range(0,len(lines)):
        j=i
        if '<?xml version="1.0"' in lines[i]:
            next_line = lines[i+1]
            write_f = open('temp_files/myfile_'+str(i)+'.nml', 'w')
            write_f.write(lines[i])
            while '</doc>' not in next_line:
                write_f.write(next_line)
                j=j+1
                next_line = lines[j]
            write_f.write(next_line)    
            write_f.close()
            list_of_files_to_keep.append(write_f.name)
    return list_of_files_to_keep

該文件的大小超過700 MB，具有超過2000萬行。 有沒有更好的方法來處理它？

如您所見，我需要使用指標變量（例如i引用上一行和下一行。

我面臨的問題是它非常慢。 每個文件都需要1多個小時，而我有多個文件。

Answer 1

您可以使用joblib軟件包使用並行處理來加快速度。 假設您有一個名為files的文件列表，其結構如下：

import ...
from joblib import Parallel, delayed

def read_the_file(file):
    ...

if __name__ == '__main__':

    n = 8 # number of processors
    Parallel(n_jobs=n)(delayed(read_the_file)(file) for file in files)

Answer 2

首先，您不需要自己確定行的總數，也不需要一次讀取整個文件。 使用像環這樣，你就已經節省一些時間。 另外，請考慮將其用於readlines() http://stupidpythonideas.blogspot.de/2013/06/readlines-considered-silly.html的使用。

考慮到您正在使用XML元素，也許可以考慮使用一個庫來簡化此工作。 特別是對於寫作。

Answer 3

建議：使用上下文管理器：
```
 with open(filename, 'r') as file: ... 
```
建議：進行垃圾級的讀取和處理（當前，您正在單步讀取文件，之后您會“逐行”瀏覽列表）：
```
 for chunk in file.read(number_of_bytes_to_read): my_function(chunk) 
```

當然，這種方式必須注意正確的xml標記開始/結束。

替代方案：查找XML Parser包。 我敢肯定，有一種可以按批處理文件的方式，包括正確的標記處理。

Python-讀取大文件

問題描述

3 個解決方案

解決方案1
0 2017-04-19 13:29:13

解決方案2
0 2017-04-19 13:34:19

解決方案3
0 2017-04-19 13:43:05

Python-讀取大文件

問題描述

3 個解決方案

解決方案1 0 2017-04-19 13:29:13

解決方案2 0 2017-04-19 13:34:19

解決方案3 0 2017-04-19 13:43:05

解決方案1
0 2017-04-19 13:29:13

解決方案2
0 2017-04-19 13:34:19

解決方案3
0 2017-04-19 13:43:05