简体   繁体   中英

Python - reading huge file

I have the following code that tries to process a huge file with multiple xml elements.

from shutil import copyfile
files_with_companies_mentions=[]
# code that reads the file line by line
def read_the_file(file_to_read):
    list_of_files_to_keep=[]
    f = open('huge_file.nml','r')
    lines=f.readlines()
    print("2. I GET HERE ")
    len_lines = len(lines)
    for i in range(0,len(lines)):
        j=i
        if '<?xml version="1.0"' in lines[i]:
            next_line = lines[i+1]
            write_f = open('temp_files/myfile_'+str(i)+'.nml', 'w')
            write_f.write(lines[i])
            while '</doc>' not in next_line:
                write_f.write(next_line)
                j=j+1
                next_line = lines[j]
            write_f.write(next_line)    
            write_f.close()
            list_of_files_to_keep.append(write_f.name)
    return list_of_files_to_keep

The file is over 700 MB large, with over 20 million rows. Is there a better way to handle it?

As you can see I need to reference to the previous and the next lines with an indicator variable such as i .

The problem I am facing is that it is very slow. It takes more than 1 hour for every file and I have multiple of these.

You can use parallel processing for speeding up, using the joblib package. Assuming you have a list of files called files , the structure would be as follows:

import ...
from joblib import Parallel, delayed

def read_the_file(file):
    ...

if __name__ == '__main__':

    n = 8 # number of processors
    Parallel(n_jobs=n)(delayed(read_the_file)(file) for file in files)

First of all you shouldn't determine the total number of lines on its own or read the whole file at once if you dont need to. Use a loop like this and you'll already save some time. Plus consider this for usage of readlines() http://stupidpythonideas.blogspot.de/2013/06/readlines-considered-silly.html .

Considering you're working with XML elements maybe consider using a lib that makes this easier. especially for the writing.

  1. suggestion: make use of a context manager:

     with open(filename, 'r') as file: ... 
  2. suggestion: do the reading and processing junk-wise (currently, you are reading the file in a single step, just afterwards you go over the list "line by line"):

     for chunk in file.read(number_of_bytes_to_read): my_function(chunk) 

Of course this way you have to look out for correct xml tag start/ends.

Alternative: look for an XML Parser package. I am quite certain there is one that can process files chunk-wise, with correct tag-handling included.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM