简体   繁体   中英

Memory issues with splitting lines in huge files in Python

I'm trying to read from disk a huge file (~2GB) and split each line into multiple strings:

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        split_lines = [line.rstrip().split() for line in f]
    return split_lines

Problem is, it tries to allocate tens and tens of GB in memory. I found out that it doesn't happen if I change my code in the following way:

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        split_lines = [line.rstrip() for line in f]    # no splitting
    return split_lines

Ie, if I do not split the lines, memory usage drastically goes down. Is there any way to handle this problem, maybe some smart way to store split lines without filling up the main memory?

Thank you for your time.

After the split, you have multiple objects: a tuple plus some number of string objects. Each object has its own overhead in addition to the actual set of characters that make up the original string.

Rather than reading the entire file into memory, use a generator.

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        for line in f:
            yield line.rstrip.split()

for t in get_split_lines(file_path):
    # Do something with the tuple t 

This does not preclude you from writing something like

lines = list(get_split_lines(file_path))

if you really need to read the entire file into memory.

In the end, I ended up storing a list of stripped lines:

with open(file_path, 'r') as f:
    split_lines = [line.rstrip() for line in f]

And, in each iteration of my algorithm, I simply recomputed on-the-fly the split line:

for line in split_lines:
    split_line = line.split()
    #do something with the split line

If you can afford to keep all the lines in memory like I did, and you have to go through all the file more than once, this approach is faster than the one proposed by @chepner as you read the file lines just once.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM