在将大文件逐行读入Python2.7时使用内存

Question

StackOverflow, 堆栈溢出，

I am working on a genomics project involving some large files (10-50Gb) which I want to read into Python 2.7 for processing. 我正在研究涉及一些大型文件（10-50Gb）的基因组学项目，我想将其读入Python 2.7进行处理。 I do not need to read the entire file into memory, but rather, simply read each file line-by-line, do a small task, and continue on. 我不需要将整个文件读入内存，而是简单地逐行读取每个文件，执行一项小任务，然后继续。

I found SO questions that were similar and tried to implement a few solutions: 我发现了类似的SO问题，并试图实现一些解决方案：

Efficient reading of 800 GB XML file in Python 2.7 在Python 2.7中高效读取800 GB XML文件

How to read large file, line by line in python 如何在python中逐行读取大文件

When I run the following code on a 17Gb file: 当我在17Gb文件上运行以下代码时：

SCRIPT 1 (itertools): 脚本1（itertools）：

#!/usr/bin/env python2

import sys
import string
import os
import itertools

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    with open("BigFile") as f:
        for line in iter(f):
            posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT 2 (fileinput): 脚本2（文件输入）：

#!/usr/bin/env python2

import sys
import string
import os
import fileinput 

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    for line in fileinput.input(['BigFile']):
        posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT3 (for line): SCRIPT3（换行）：

#!/usr/bin/env python2

import sys
import string
import os

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    with open("BigFile") as f:
        for line in f:
            posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT4 (yield): SCRIPT4（产量）：

#!/usr/bin/env python2

import sys
import string
import os

def readInChunks(fileObj, chunkSize=30):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        yield data

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    f = open('BigFile')
    for chunk in readInChunks(f):
        posList.append(chunk.strip())
    f.close()
    sys.stdout.write(str(sys.getsizeof(posList)))

From the 17Gb file, the size of the final list in Python is ~5Gb [ from sys.getsizeof() ], but according to 'top' each script uses upwards of 43Gb of memory. 从17Gb文件中，Python中最终列表的大小是〜5Gb [来自sys.getsizeof（）]，但是根据'top'，每个脚本使用超过43Gb的内存。

My question is: Why does the memory usage rise so much higher than that of the input file or the final list? 我的问题是：为什么内存使用量比输入文件或最终列表的上升得多？ If the final list is only 5Gb, and the 17Gb file input is being read line-by-line, why does the memory use for each script hit ~43Gb? 如果最终列表只有5Gb，并且逐行读取17Gb文件输入，为什么每个脚本的内存使用量达到〜43Gb？ Is there a better way to read in large files without memory leaks (if that's what they are)? 有没有更好的方法来读取没有内存泄漏的大文件（如果这是他们的）？

Many thanks. 非常感谢。

EDIT: 编辑：

Output from '/usr/bin/time -v python script3.py': 从'/ usr / bin / time -v python script3.py'输出：

Command being timed: "python script3.py"
User time (seconds): 159.65
System time (seconds): 21.74
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:01.96
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 181246448
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 10182731
Voluntary context switches: 315
Involuntary context switches: 16722
Swaps: 0
File system inputs: 33831512
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Output from top: 从顶部输出：

15816   user    20  0   727m    609m    2032    R   76.8    0.5 0:02.31 python
15816   user    20  0   1541m   1.4g    2032    R   99.6    1.1 0:05.31 python
15816   user    20  0   2362m   2.2g    2032    R   99.6    1.7 0:08.31 python
15816   user    20  0   3194m   3.0g    2032    R   99.6    2.4 0:11.31 python
15816   user    20  0   4014m   3.8g    2032    R   99.6    3   0:14.31 python
15816   user    20  0   4795m   4.6g    2032    R   99.6    3.6 0:17.31 python
15816   user    20  0   5653m   5.3g    2032    R   99.6    4.2 0:20.31 python
15816   user    20  0   6457m   6.1g    2032    R   99.3    4.9 0:23.30 python
15816   user    20  0   7260m   6.9g    2032    R   99.6    5.5 0:26.30 python
15816   user    20  0   8085m   7.7g    2032    R   99.9    6.1 0:29.31 python
15816   user    20  0   8809m   8.5g    2032    R   99.6    6.7 0:32.31 python
15816   user    20  0   9645m   9.3g    2032    R   99.3    7.4 0:35.30 python
15816   user    20  0   10.3g   10g 2032    R   99.6    8   0:38.30 python
15816   user    20  0   11.1g   10g 2032    R   100 8.6 0:41.31 python
15816   user    20  0   11.8g   11g 2032    R   99.9    9.2 0:44.32 python
15816   user    20  0   12.7g   12g 2032    R   99.3    9.9 0:47.31 python
15816   user    20  0   13.4g   13g 2032    R   99.6    10.5    0:50.31 python
15816   user    20  0   14.3g   14g 2032    R   99.9    11.1    0:53.32 python
15816   user    20  0   15.0g   14g 2032    R   99.3    11.7    0:56.31 python
15816   user    20  0   15.9g   15g 2032    R   99.9    12.4    0:59.32 python
15816   user    20  0   16.6g   16g 2032    R   99.6    13  1:02.32 python
15816   user    20  0   17.3g   17g 2032    R   99.6    13.6    1:05.32 python
15816   user    20  0   18.2g   17g 2032    R   99.9    14.2    1:08.33 python
15816   user    20  0   18.9g   18g 2032    R   99.6    14.9    1:11.33 python
15816   user    20  0   19.9g   19g 2032    R   100 15.5    1:14.34 python
15816   user    20  0   20.6g   20g 2032    R   99.3    16.1    1:17.33 python
15816   user    20  0   21.3g   21g 2032    R   99.6    16.7    1:20.33 python
15816   user    20  0   22.3g   21g 2032    R   99.9    17.4    1:23.34 python
15816   user    20  0   23.0g   22g 2032    R   99.6    18  1:26.34 python
15816   user    20  0   23.7g   23g 2032    R   99.6    18.6    1:29.34 python
15816   user    20  0   24.4g   24g 2032    R   99.6    19.2    1:32.34 python
15816   user    20  0   25.4g   25g 2032    R   99.3    19.9    1:35.33 python
15816   user    20  0   26.1g   25g 2032    R   99.9    20.5    1:38.34 python
15816   user    20  0   26.8g   26g 2032    R   99.9    21.1    1:41.35 python
15816   user    20  0   27.4g   27g 2032    R   99.6    21.7    1:44.35 python
15816   user    20  0   28.5g   28g 2032    R   99.6    22.3    1:47.35 python
15816   user    20  0   29.2g   28g 2032    R   99.9    22.9    1:50.36 python
15816   user    20  0   29.9g   29g 2032    R   99.6    23.5    1:53.36 python
15816   user    20  0   30.5g   30g 2032    R   99.6    24.1    1:56.36 python
15816   user    20  0   31.6g   31g 2032    R   99.6    24.7    1:59.36 python
15816   user    20  0   32.3g   31g 2032    R   100 25.3    2:02.37 python
15816   user    20  0   33.0g   32g 2032    R   99.6    25.9    2:05.37 python
15816   user    20  0   33.7g   33g 2032    R   99.6    26.5    2:08.37 python
15816   user    20  0   34.3g   34g 2032    R   99.6    27.1    2:11.37 python
15816   user    20  0   35.5g   34g 2032    R   99.6    27.7    2:14.37 python
15816   user    20  0   36.2g   35g 2032    R   99.6    28.4    2:17.37 python
15816   user    20  0   36.9g   36g 2032    R   100 29  2:20.38 python
15816   user    20  0   37.5g   37g 2032    R   99.6    29.6    2:23.38 python
15816   user    20  0   38.2g   38g 2032    R   99.6    30.2    2:26.38 python
15816   user    20  0   38.9g   38g 2032    R   99.6    30.8    2:29.38 python
15816   user    20  0   40.1g   39g 2032    R   100 31.4    2:32.39 python
15816   user    20  0   40.8g   40g 2032    R   99.6    32  2:35.39 python
15816   user    20  0   41.5g   41g 2032    R   99.6    32.6    2:38.39 python
15816   user    20  0   42.2g   41g 2032    R   99.9    33.2    2:41.40 python
15816   user    20  0   42.8g   42g 2032    R   99.6    33.8    2:44.40 python
15816   user    20  0   43.4g   43g 2032    R   99.6    34.3    2:47.40 python
15816   user    20  0   43.4g   43g 2032    R   100 34.3    2:50.41 python
15816   user    20  0   38.6g   38g 2032    R   100 30.5    2:53.43 python
15816   user    20  0   24.9g   24g 2032    R   99.7    19.6    2:56.43 python
15816   user    20  0   12.0g   11g 2032    R   100 9.4 2:59.44 python

Edit 2: 编辑2：

For further clarification, here is an expansion of the issue. 为进一步澄清，这是问题的扩展。 What I'm doing here is reading in a list of positions in a FASTA file (Contig1/1,Contig1/2, etc.). 我在这里做的是读取FASTA文件中的位置列表（Contig1 / 1，Contig1 / 2等）。 That is being converted to a dictionary full of N's via: 通过以下方式将其转换为充满N的字典：

keys = posList
values = ['N'] * len(posList)
speciesDict = dict(zip(keys, values))

Then, I'm reading in pileup files for multiple species, again line-by-line (where the same problem will exist), and getting the final base call via: 然后，我正在阅读多个物种的堆积文件，再次逐行（存在同样的问题），并通过以下方式获得最终的基本调用：

with open (path+'/'+os.path.basename(path)+'.pileups',"r") as filein:
    for line in iter(filein):
        splitline=line.split()
        if len(splitline)>4:
            node,pos,ref,num,bases,qual=line.split()
            loc=node+'/'+pos
            cleanBases=getCleanList(ref,bases)
            finalBase=getFinalBase_Pruned(cleanBases,minread,thresh)
            speciesDict[loc] = finalBase

Because the species-specific pileup files are not the same length, or in the same order, I am creating the list to create a 'common-garden' way to store the individual species data. 因为物种特定的堆积文件长度不同，或者顺序相同，所以我创建了一个列表来创建一个“共同花园”的方式来存储单个物种数据。 If no data is available for a given site for a species, it gets an 'N' call. 如果物种的特定站点没有可用数据，则会收到“N”呼叫。 Otherwise, a base is assigned to the site in the dictionary. 否则，将为字典中的站点分配基础。

The end result is a file for each species which is ordered and complete, from which I can do downstream analyses. 最终结果是每个物种的文件是有序和完整的，我可以从中进行下游分析。

Because the line-by-line reading is eating up so much memory, reading in TWO large files will overload my resources, even though the final data structures are much smaller than the memory that I expected to be required (size of growing lists + single line at a time of data to be added). 因为逐行读取占用了大量内存，所以读取两个大文件会使我的资源过载，即使最终数据结构比我预期需要的内存小得多（增长列表的大小+单个在要添加的数据时的行）。

Answer 1

sys.getsizeof(posList) is not giving you what I think you think it is: it's telling you the size of the list object containing the lines; sys.getsizeof(posList)没有给你我认为你认为它的东西：它告诉你包含行的列表对象的大小; this does not include the size of the lines themselves . 这不包括行本身的大小。 Below are some outputs from reading a roughly 3.5Gb file into a list on my system: 以下是将大约3.5Gb文件读入系统列表的一些输出：

In [2]: lines = []

In [3]: with open('bigfile') as inf:
   ...:     for line in inf:
   ...:         lines.append(line)
   ...:
In [4]: len(lines)
Out[4]: 68318734

In [5]: sys.getsizeof(lines)
Out[5]: 603811872

In [6]: sum(len(l) for l in lines)
Out[6]: 3473926127

In [7]: sum(sys.getsizeof(l) for l in lines)
Out[7]: 6001719285

That's a bit over six billion bytes, there; 那里有六十多亿字节; in top my interpreter was using about 7.5Gb at this point. 在顶部，我的翻译在这一点上使用了大约7.5Gb。

Strings have considerable overhead: 37 bytes each, it looks like: 字符串有相当大的开销：每个37字节，它看起来像：

In [2]: sys.getsizeof('0'*10)
Out[2]: 47

In [3]: sys.getsizeof('0'*100)
Out[3]: 137

In [4]: sys.getsizeof('0'*1000)
Out[4]: 1037

So if your lines are relatively short, a large part of the memory use will be overhead. 因此，如果您的线路相对较短，则大部分内存使用将是开销。

Answer 2

While not directly addressing the question as to why there is such a memory overhead, which @nathan-vērzemnieks answers, a highly efficient solution to your problem may be to use a Python bitarray : 虽然没有直接解决为什么存在这样的内存开销的问题， @ nathan-vērzemnieks回答，但问题的高效解决方案可能是使用Python bitarray ：

https://pypi.python.org/pypi/bitarray https://pypi.python.org/pypi/bitarray

I have used this in the past to store presence/absence of thousands of DNA motifs in 16S RNA from over 250,000 species from the SILVA database . 我过去曾用这个来保存来自SILVA数据库的 250,000多种物种的16S RNA中数千个DNA基序的存在/不存在。 It basically encodes your Y/N flag into a single bit, instead of using overhead associated with storing the character Y or N. 它基本上将您的Y / N标志编码为单个位，而不是使用与存储字符Y或N相关联的开销。

在将大文件逐行读入Python2.7时使用内存

问题描述

2 个解决方案

解决方案1
2 2018-01-23 04:35:15

解决方案2
1 2018-01-23 16:40:56

在将大文件逐行读入Python2.7时使用内存

问题描述

2 个解决方案

解决方案1 2 2018-01-23 04:35:15

解决方案2 1 2018-01-23 16:40:56

解决方案1
2 2018-01-23 04:35:15

解决方案2
1 2018-01-23 16:40:56