简体   繁体   English

在将大文件逐行读入Python2.7时使用内存

[英]Memory use while doing line-by-line reading of large file into Python2.7

StackOverflow, 堆栈溢出,

I am working on a genomics project involving some large files (10-50Gb) which I want to read into Python 2.7 for processing. 我正在研究涉及一些大型文件(10-50Gb)的基因组学项目,我想将其读入Python 2.7进行处理。 I do not need to read the entire file into memory, but rather, simply read each file line-by-line, do a small task, and continue on. 我不需要将整个文件读入内存,而是简单地逐行读取每个文件,执行一项小任务,然后继续。

I found SO questions that were similar and tried to implement a few solutions: 我发现了类似的SO问题,并试图实现一些解决方案:

Efficient reading of 800 GB XML file in Python 2.7 在Python 2.7中高效读取800 GB XML文件

How to read large file, line by line in python 如何在python中逐行读取大文件

When I run the following code on a 17Gb file: 当我在17Gb文件上运行以下代码时:

SCRIPT 1 (itertools): 脚本1(itertools):

#!/usr/bin/env python2

import sys
import string
import os
import itertools

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    with open("BigFile") as f:
        for line in iter(f):
            posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT 2 (fileinput): 脚本2(文件输入):

#!/usr/bin/env python2

import sys
import string
import os
import fileinput 

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    for line in fileinput.input(['BigFile']):
        posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT3 (for line): SCRIPT3(换行):

#!/usr/bin/env python2

import sys
import string
import os

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    with open("BigFile") as f:
        for line in f:
            posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT4 (yield): SCRIPT4(产量):

#!/usr/bin/env python2

import sys
import string
import os

def readInChunks(fileObj, chunkSize=30):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        yield data

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    f = open('BigFile')
    for chunk in readInChunks(f):
        posList.append(chunk.strip())
    f.close()
    sys.stdout.write(str(sys.getsizeof(posList)))

From the 17Gb file, the size of the final list in Python is ~5Gb [ from sys.getsizeof() ], but according to 'top' each script uses upwards of 43Gb of memory. 从17Gb文件中,Python中最终列表的大小是〜5Gb [来自sys.getsizeof()],但是根据'top',每个脚本使用超过43Gb的内存。

My question is: Why does the memory usage rise so much higher than that of the input file or the final list? 我的问题是:为什么内存使用量比输入文件或最终列表的上升得多? If the final list is only 5Gb, and the 17Gb file input is being read line-by-line, why does the memory use for each script hit ~43Gb? 如果最终列表只有5Gb,并且逐行读取17Gb文件输入,为什么每个脚本的内存使用量达到〜43Gb? Is there a better way to read in large files without memory leaks (if that's what they are)? 有没有更好的方法来读取没有内存泄漏的大文件(如果这是他们的)?

Many thanks. 非常感谢。

EDIT: 编辑:

Output from '/usr/bin/time -v python script3.py': 从'/ usr / bin / time -v python script3.py'输出:

Command being timed: "python script3.py"
User time (seconds): 159.65
System time (seconds): 21.74
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:01.96
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 181246448
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 10182731
Voluntary context switches: 315
Involuntary context switches: 16722
Swaps: 0
File system inputs: 33831512
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Output from top: 从顶部输出:

15816   user    20  0   727m    609m    2032    R   76.8    0.5 0:02.31 python
15816   user    20  0   1541m   1.4g    2032    R   99.6    1.1 0:05.31 python
15816   user    20  0   2362m   2.2g    2032    R   99.6    1.7 0:08.31 python
15816   user    20  0   3194m   3.0g    2032    R   99.6    2.4 0:11.31 python
15816   user    20  0   4014m   3.8g    2032    R   99.6    3   0:14.31 python
15816   user    20  0   4795m   4.6g    2032    R   99.6    3.6 0:17.31 python
15816   user    20  0   5653m   5.3g    2032    R   99.6    4.2 0:20.31 python
15816   user    20  0   6457m   6.1g    2032    R   99.3    4.9 0:23.30 python
15816   user    20  0   7260m   6.9g    2032    R   99.6    5.5 0:26.30 python
15816   user    20  0   8085m   7.7g    2032    R   99.9    6.1 0:29.31 python
15816   user    20  0   8809m   8.5g    2032    R   99.6    6.7 0:32.31 python
15816   user    20  0   9645m   9.3g    2032    R   99.3    7.4 0:35.30 python
15816   user    20  0   10.3g   10g 2032    R   99.6    8   0:38.30 python
15816   user    20  0   11.1g   10g 2032    R   100 8.6 0:41.31 python
15816   user    20  0   11.8g   11g 2032    R   99.9    9.2 0:44.32 python
15816   user    20  0   12.7g   12g 2032    R   99.3    9.9 0:47.31 python
15816   user    20  0   13.4g   13g 2032    R   99.6    10.5    0:50.31 python
15816   user    20  0   14.3g   14g 2032    R   99.9    11.1    0:53.32 python
15816   user    20  0   15.0g   14g 2032    R   99.3    11.7    0:56.31 python
15816   user    20  0   15.9g   15g 2032    R   99.9    12.4    0:59.32 python
15816   user    20  0   16.6g   16g 2032    R   99.6    13  1:02.32 python
15816   user    20  0   17.3g   17g 2032    R   99.6    13.6    1:05.32 python
15816   user    20  0   18.2g   17g 2032    R   99.9    14.2    1:08.33 python
15816   user    20  0   18.9g   18g 2032    R   99.6    14.9    1:11.33 python
15816   user    20  0   19.9g   19g 2032    R   100 15.5    1:14.34 python
15816   user    20  0   20.6g   20g 2032    R   99.3    16.1    1:17.33 python
15816   user    20  0   21.3g   21g 2032    R   99.6    16.7    1:20.33 python
15816   user    20  0   22.3g   21g 2032    R   99.9    17.4    1:23.34 python
15816   user    20  0   23.0g   22g 2032    R   99.6    18  1:26.34 python
15816   user    20  0   23.7g   23g 2032    R   99.6    18.6    1:29.34 python
15816   user    20  0   24.4g   24g 2032    R   99.6    19.2    1:32.34 python
15816   user    20  0   25.4g   25g 2032    R   99.3    19.9    1:35.33 python
15816   user    20  0   26.1g   25g 2032    R   99.9    20.5    1:38.34 python
15816   user    20  0   26.8g   26g 2032    R   99.9    21.1    1:41.35 python
15816   user    20  0   27.4g   27g 2032    R   99.6    21.7    1:44.35 python
15816   user    20  0   28.5g   28g 2032    R   99.6    22.3    1:47.35 python
15816   user    20  0   29.2g   28g 2032    R   99.9    22.9    1:50.36 python
15816   user    20  0   29.9g   29g 2032    R   99.6    23.5    1:53.36 python
15816   user    20  0   30.5g   30g 2032    R   99.6    24.1    1:56.36 python
15816   user    20  0   31.6g   31g 2032    R   99.6    24.7    1:59.36 python
15816   user    20  0   32.3g   31g 2032    R   100 25.3    2:02.37 python
15816   user    20  0   33.0g   32g 2032    R   99.6    25.9    2:05.37 python
15816   user    20  0   33.7g   33g 2032    R   99.6    26.5    2:08.37 python
15816   user    20  0   34.3g   34g 2032    R   99.6    27.1    2:11.37 python
15816   user    20  0   35.5g   34g 2032    R   99.6    27.7    2:14.37 python
15816   user    20  0   36.2g   35g 2032    R   99.6    28.4    2:17.37 python
15816   user    20  0   36.9g   36g 2032    R   100 29  2:20.38 python
15816   user    20  0   37.5g   37g 2032    R   99.6    29.6    2:23.38 python
15816   user    20  0   38.2g   38g 2032    R   99.6    30.2    2:26.38 python
15816   user    20  0   38.9g   38g 2032    R   99.6    30.8    2:29.38 python
15816   user    20  0   40.1g   39g 2032    R   100 31.4    2:32.39 python
15816   user    20  0   40.8g   40g 2032    R   99.6    32  2:35.39 python
15816   user    20  0   41.5g   41g 2032    R   99.6    32.6    2:38.39 python
15816   user    20  0   42.2g   41g 2032    R   99.9    33.2    2:41.40 python
15816   user    20  0   42.8g   42g 2032    R   99.6    33.8    2:44.40 python
15816   user    20  0   43.4g   43g 2032    R   99.6    34.3    2:47.40 python
15816   user    20  0   43.4g   43g 2032    R   100 34.3    2:50.41 python
15816   user    20  0   38.6g   38g 2032    R   100 30.5    2:53.43 python
15816   user    20  0   24.9g   24g 2032    R   99.7    19.6    2:56.43 python
15816   user    20  0   12.0g   11g 2032    R   100 9.4 2:59.44 python

Edit 2: 编辑2:

For further clarification, here is an expansion of the issue. 为进一步澄清,这是问题的扩展。 What I'm doing here is reading in a list of positions in a FASTA file (Contig1/1,Contig1/2, etc.). 我在这里做的是读取FASTA文件中的位置列表(Contig1 / 1,Contig1 / 2等)。 That is being converted to a dictionary full of N's via: 通过以下方式将其转换为充满N的字典:

keys = posList
values = ['N'] * len(posList)
speciesDict = dict(zip(keys, values))

Then, I'm reading in pileup files for multiple species, again line-by-line (where the same problem will exist), and getting the final base call via: 然后,我正在阅读多个物种的堆积文件,再次逐行(存在同样的问题),并通过以下方式获得最终的基本调用:

with open (path+'/'+os.path.basename(path)+'.pileups',"r") as filein:
    for line in iter(filein):
        splitline=line.split()
        if len(splitline)>4:
            node,pos,ref,num,bases,qual=line.split()
            loc=node+'/'+pos
            cleanBases=getCleanList(ref,bases)
            finalBase=getFinalBase_Pruned(cleanBases,minread,thresh)
            speciesDict[loc] = finalBase

Because the species-specific pileup files are not the same length, or in the same order, I am creating the list to create a 'common-garden' way to store the individual species data. 因为物种特定的堆积文件长度不同,或者顺序相同,所以我创建了一个列表来创建一个“共同花园”的方式来存储单个物种数据。 If no data is available for a given site for a species, it gets an 'N' call. 如果物种的特定站点没有可用数据,则会收到“N”呼叫。 Otherwise, a base is assigned to the site in the dictionary. 否则,将为字典中的站点分配基础。

The end result is a file for each species which is ordered and complete, from which I can do downstream analyses. 最终结果是每个物种的文件是有序和完整的,我可以从中进行下游分析。

Because the line-by-line reading is eating up so much memory, reading in TWO large files will overload my resources, even though the final data structures are much smaller than the memory that I expected to be required (size of growing lists + single line at a time of data to be added). 因为逐行读取占用了大量内存,所以读取两个大文件会使我的资源过载,即使最终数据结构比我预期需要的内存小得多(增长列表的大小+单个在要添加的数据时的行)。

sys.getsizeof(posList) is not giving you what I think you think it is: it's telling you the size of the list object containing the lines; sys.getsizeof(posList)没有给你我认为你认为它的东西:它告诉你包含行的列表对象的大小; this does not include the size of the lines themselves . 这不包括行本身的大小。 Below are some outputs from reading a roughly 3.5Gb file into a list on my system: 以下是将大约3.5Gb文件读入系统列表的一些输出:

In [2]: lines = []

In [3]: with open('bigfile') as inf:
   ...:     for line in inf:
   ...:         lines.append(line)
   ...:
In [4]: len(lines)
Out[4]: 68318734

In [5]: sys.getsizeof(lines)
Out[5]: 603811872

In [6]: sum(len(l) for l in lines)
Out[6]: 3473926127

In [7]: sum(sys.getsizeof(l) for l in lines)
Out[7]: 6001719285

That's a bit over six billion bytes, there; 那里有六十多亿字节; in top my interpreter was using about 7.5Gb at this point. 在顶部,我的翻译在这一点上使用了大约7.5Gb。

Strings have considerable overhead: 37 bytes each, it looks like: 字符串有相当大的开销:每个37字节,它看起来像:

In [2]: sys.getsizeof('0'*10)
Out[2]: 47

In [3]: sys.getsizeof('0'*100)
Out[3]: 137

In [4]: sys.getsizeof('0'*1000)
Out[4]: 1037

So if your lines are relatively short, a large part of the memory use will be overhead. 因此,如果您的线路相对较短,则大部分内存使用将是开销。

While not directly addressing the question as to why there is such a memory overhead, which @nathan-vērzemnieks answers, a highly efficient solution to your problem may be to use a Python bitarray : 虽然没有直接解决为什么存在这样的内存开销的问题, @ nathan-vērzemnieks回答, 问题的高效解决方案可能是使用Python bitarray

https://pypi.python.org/pypi/bitarray https://pypi.python.org/pypi/bitarray

I have used this in the past to store presence/absence of thousands of DNA motifs in 16S RNA from over 250,000 species from the SILVA database . 我过去曾用这个来保存来自SILVA数据库的 250,000多种物种的16S RNA中数千个DNA基序的存在/不存在。 It basically encodes your Y/N flag into a single bit, instead of using overhead associated with storing the character Y or N. 它基本上将您的Y / N标志编码为单个位,而不是使用与存储字符Y或N相关联的开销。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM