Memory use while doing line-by-line reading of large file into Python2.7

Question

StackOverflow,

I am working on a genomics project involving some large files (10-50Gb) which I want to read into Python 2.7 for processing. I do not need to read the entire file into memory, but rather, simply read each file line-by-line, do a small task, and continue on.

I found SO questions that were similar and tried to implement a few solutions:

Efficient reading of 800 GB XML file in Python 2.7

How to read large file, line by line in python

When I run the following code on a 17Gb file:

SCRIPT 1 (itertools):

#!/usr/bin/env python2

import sys
import string
import os
import itertools

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    with open("BigFile") as f:
        for line in iter(f):
            posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT 2 (fileinput):

#!/usr/bin/env python2

import sys
import string
import os
import fileinput 

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    for line in fileinput.input(['BigFile']):
        posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT3 (for line):

#!/usr/bin/env python2

import sys
import string
import os

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    with open("BigFile") as f:
        for line in f:
            posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT4 (yield):

#!/usr/bin/env python2

import sys
import string
import os

def readInChunks(fileObj, chunkSize=30):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        yield data

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    f = open('BigFile')
    for chunk in readInChunks(f):
        posList.append(chunk.strip())
    f.close()
    sys.stdout.write(str(sys.getsizeof(posList)))

From the 17Gb file, the size of the final list in Python is ~5Gb [ from sys.getsizeof() ], but according to 'top' each script uses upwards of 43Gb of memory.

My question is: Why does the memory usage rise so much higher than that of the input file or the final list? If the final list is only 5Gb, and the 17Gb file input is being read line-by-line, why does the memory use for each script hit ~43Gb? Is there a better way to read in large files without memory leaks (if that's what they are)?

Many thanks.

EDIT:

Output from '/usr/bin/time -v python script3.py':

Command being timed: "python script3.py"
User time (seconds): 159.65
System time (seconds): 21.74
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:01.96
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 181246448
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 10182731
Voluntary context switches: 315
Involuntary context switches: 16722
Swaps: 0
File system inputs: 33831512
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Output from top:

15816   user    20  0   727m    609m    2032    R   76.8    0.5 0:02.31 python
15816   user    20  0   1541m   1.4g    2032    R   99.6    1.1 0:05.31 python
15816   user    20  0   2362m   2.2g    2032    R   99.6    1.7 0:08.31 python
15816   user    20  0   3194m   3.0g    2032    R   99.6    2.4 0:11.31 python
15816   user    20  0   4014m   3.8g    2032    R   99.6    3   0:14.31 python
15816   user    20  0   4795m   4.6g    2032    R   99.6    3.6 0:17.31 python
15816   user    20  0   5653m   5.3g    2032    R   99.6    4.2 0:20.31 python
15816   user    20  0   6457m   6.1g    2032    R   99.3    4.9 0:23.30 python
15816   user    20  0   7260m   6.9g    2032    R   99.6    5.5 0:26.30 python
15816   user    20  0   8085m   7.7g    2032    R   99.9    6.1 0:29.31 python
15816   user    20  0   8809m   8.5g    2032    R   99.6    6.7 0:32.31 python
15816   user    20  0   9645m   9.3g    2032    R   99.3    7.4 0:35.30 python
15816   user    20  0   10.3g   10g 2032    R   99.6    8   0:38.30 python
15816   user    20  0   11.1g   10g 2032    R   100 8.6 0:41.31 python
15816   user    20  0   11.8g   11g 2032    R   99.9    9.2 0:44.32 python
15816   user    20  0   12.7g   12g 2032    R   99.3    9.9 0:47.31 python
15816   user    20  0   13.4g   13g 2032    R   99.6    10.5    0:50.31 python
15816   user    20  0   14.3g   14g 2032    R   99.9    11.1    0:53.32 python
15816   user    20  0   15.0g   14g 2032    R   99.3    11.7    0:56.31 python
15816   user    20  0   15.9g   15g 2032    R   99.9    12.4    0:59.32 python
15816   user    20  0   16.6g   16g 2032    R   99.6    13  1:02.32 python
15816   user    20  0   17.3g   17g 2032    R   99.6    13.6    1:05.32 python
15816   user    20  0   18.2g   17g 2032    R   99.9    14.2    1:08.33 python
15816   user    20  0   18.9g   18g 2032    R   99.6    14.9    1:11.33 python
15816   user    20  0   19.9g   19g 2032    R   100 15.5    1:14.34 python
15816   user    20  0   20.6g   20g 2032    R   99.3    16.1    1:17.33 python
15816   user    20  0   21.3g   21g 2032    R   99.6    16.7    1:20.33 python
15816   user    20  0   22.3g   21g 2032    R   99.9    17.4    1:23.34 python
15816   user    20  0   23.0g   22g 2032    R   99.6    18  1:26.34 python
15816   user    20  0   23.7g   23g 2032    R   99.6    18.6    1:29.34 python
15816   user    20  0   24.4g   24g 2032    R   99.6    19.2    1:32.34 python
15816   user    20  0   25.4g   25g 2032    R   99.3    19.9    1:35.33 python
15816   user    20  0   26.1g   25g 2032    R   99.9    20.5    1:38.34 python
15816   user    20  0   26.8g   26g 2032    R   99.9    21.1    1:41.35 python
15816   user    20  0   27.4g   27g 2032    R   99.6    21.7    1:44.35 python
15816   user    20  0   28.5g   28g 2032    R   99.6    22.3    1:47.35 python
15816   user    20  0   29.2g   28g 2032    R   99.9    22.9    1:50.36 python
15816   user    20  0   29.9g   29g 2032    R   99.6    23.5    1:53.36 python
15816   user    20  0   30.5g   30g 2032    R   99.6    24.1    1:56.36 python
15816   user    20  0   31.6g   31g 2032    R   99.6    24.7    1:59.36 python
15816   user    20  0   32.3g   31g 2032    R   100 25.3    2:02.37 python
15816   user    20  0   33.0g   32g 2032    R   99.6    25.9    2:05.37 python
15816   user    20  0   33.7g   33g 2032    R   99.6    26.5    2:08.37 python
15816   user    20  0   34.3g   34g 2032    R   99.6    27.1    2:11.37 python
15816   user    20  0   35.5g   34g 2032    R   99.6    27.7    2:14.37 python
15816   user    20  0   36.2g   35g 2032    R   99.6    28.4    2:17.37 python
15816   user    20  0   36.9g   36g 2032    R   100 29  2:20.38 python
15816   user    20  0   37.5g   37g 2032    R   99.6    29.6    2:23.38 python
15816   user    20  0   38.2g   38g 2032    R   99.6    30.2    2:26.38 python
15816   user    20  0   38.9g   38g 2032    R   99.6    30.8    2:29.38 python
15816   user    20  0   40.1g   39g 2032    R   100 31.4    2:32.39 python
15816   user    20  0   40.8g   40g 2032    R   99.6    32  2:35.39 python
15816   user    20  0   41.5g   41g 2032    R   99.6    32.6    2:38.39 python
15816   user    20  0   42.2g   41g 2032    R   99.9    33.2    2:41.40 python
15816   user    20  0   42.8g   42g 2032    R   99.6    33.8    2:44.40 python
15816   user    20  0   43.4g   43g 2032    R   99.6    34.3    2:47.40 python
15816   user    20  0   43.4g   43g 2032    R   100 34.3    2:50.41 python
15816   user    20  0   38.6g   38g 2032    R   100 30.5    2:53.43 python
15816   user    20  0   24.9g   24g 2032    R   99.7    19.6    2:56.43 python
15816   user    20  0   12.0g   11g 2032    R   100 9.4 2:59.44 python

Edit 2:

For further clarification, here is an expansion of the issue. What I'm doing here is reading in a list of positions in a FASTA file (Contig1/1,Contig1/2, etc.). That is being converted to a dictionary full of N's via:

keys = posList
values = ['N'] * len(posList)
speciesDict = dict(zip(keys, values))

Then, I'm reading in pileup files for multiple species, again line-by-line (where the same problem will exist), and getting the final base call via:

with open (path+'/'+os.path.basename(path)+'.pileups',"r") as filein:
    for line in iter(filein):
        splitline=line.split()
        if len(splitline)>4:
            node,pos,ref,num,bases,qual=line.split()
            loc=node+'/'+pos
            cleanBases=getCleanList(ref,bases)
            finalBase=getFinalBase_Pruned(cleanBases,minread,thresh)
            speciesDict[loc] = finalBase

Because the species-specific pileup files are not the same length, or in the same order, I am creating the list to create a 'common-garden' way to store the individual species data. If no data is available for a given site for a species, it gets an 'N' call. Otherwise, a base is assigned to the site in the dictionary.

The end result is a file for each species which is ordered and complete, from which I can do downstream analyses.

Because the line-by-line reading is eating up so much memory, reading in TWO large files will overload my resources, even though the final data structures are much smaller than the memory that I expected to be required (size of growing lists + single line at a time of data to be added).

Answer 1

sys.getsizeof(posList) is not giving you what I think you think it is: it's telling you the size of the list object containing the lines; this does not include the size of the lines themselves . Below are some outputs from reading a roughly 3.5Gb file into a list on my system:

In [2]: lines = []

In [3]: with open('bigfile') as inf:
   ...:     for line in inf:
   ...:         lines.append(line)
   ...:
In [4]: len(lines)
Out[4]: 68318734

In [5]: sys.getsizeof(lines)
Out[5]: 603811872

In [6]: sum(len(l) for l in lines)
Out[6]: 3473926127

In [7]: sum(sys.getsizeof(l) for l in lines)
Out[7]: 6001719285

That's a bit over six billion bytes, there; in top my interpreter was using about 7.5Gb at this point.

Strings have considerable overhead: 37 bytes each, it looks like:

In [2]: sys.getsizeof('0'*10)
Out[2]: 47

In [3]: sys.getsizeof('0'*100)
Out[3]: 137

In [4]: sys.getsizeof('0'*1000)
Out[4]: 1037

So if your lines are relatively short, a large part of the memory use will be overhead.

Answer 2

While not directly addressing the question as to why there is such a memory overhead, which @nathan-vērzemnieks answers, a highly efficient solution to your problem may be to use a Python bitarray :

https://pypi.python.org/pypi/bitarray

I have used this in the past to store presence/absence of thousands of DNA motifs in 16S RNA from over 250,000 species from the SILVA database . It basically encodes your Y/N flag into a single bit, instead of using overhead associated with storing the character Y or N.

Memory use while doing line-by-line reading of large file into Python2.7

Question

2 answers

solution1
2 2018-01-23 04:35:15

solution2
1 2018-01-23 16:40:56

Memory use while doing line-by-line reading of large file into Python2.7

Question

2 answers

solution1 2 2018-01-23 04:35:15

solution2 1 2018-01-23 16:40:56

solution1
2 2018-01-23 04:35:15

solution2
1 2018-01-23 16:40:56