简体   繁体   English

Python中大文件中的行分割引起的内存问题

[英]Memory issues with splitting lines in huge files in Python

I'm trying to read from disk a huge file (~2GB) and split each line into multiple strings: 我正在尝试从磁盘读取一个大文件(〜2GB),并将每一行拆分为多个字符串:

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        split_lines = [line.rstrip().split() for line in f]
    return split_lines

Problem is, it tries to allocate tens and tens of GB in memory. 问题是,它试图在内存中分配数十个GB。 I found out that it doesn't happen if I change my code in the following way: 我发现如果以以下方式更改代码不会发生这种情况:

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        split_lines = [line.rstrip() for line in f]    # no splitting
    return split_lines

Ie, if I do not split the lines, memory usage drastically goes down. 即,如果我不分线,则内存使用会急剧下降。 Is there any way to handle this problem, maybe some smart way to store split lines without filling up the main memory? 有什么办法可以解决这个问题,也许是一些聪明的方法来存储分割行而不填满主存储器?

Thank you for your time. 感谢您的时间。

After the split, you have multiple objects: a tuple plus some number of string objects. 拆分之后,您将拥有多个对象:一个元组和一些字符串对象。 Each object has its own overhead in addition to the actual set of characters that make up the original string. 除了组成原始字符串的实际字符集之外,每个对象都有自己的开销。

Rather than reading the entire file into memory, use a generator. 与其将整个文件读入内存,不如使用生成器。

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        for line in f:
            yield line.rstrip.split()

for t in get_split_lines(file_path):
    # Do something with the tuple t 

This does not preclude you from writing something like 这并不妨碍您编写类似

lines = list(get_split_lines(file_path))

if you really need to read the entire file into memory. 如果确实需要将整个文件读入内存。

In the end, I ended up storing a list of stripped lines: 最后,我最终存储了一条剥离线的列表:

with open(file_path, 'r') as f:
    split_lines = [line.rstrip() for line in f]

And, in each iteration of my algorithm, I simply recomputed on-the-fly the split line: 而且,在算法的每次迭代中,我都只是即时重新计算了分割线:

for line in split_lines:
    split_line = line.split()
    #do something with the split line

If you can afford to keep all the lines in memory like I did, and you have to go through all the file more than once, this approach is faster than the one proposed by @chepner as you read the file lines just once. 如果您能够像我一样将所有行都保留在内存中,并且必须多次浏览所有文件,则此方法比@chepner提出的方法要快得多,因为您只需读取一次文件行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM