简体   繁体   中英

Likely memory leak in generator loop with islice

I am working with large files holding several million records each (approx 2GB unpacked, several hundred MBs gzip).

I iterate over the records with islice , which allows me to get either a small portion (for debug and development) or the whole thing when I want to test the code. I have noticed an absurdly large memory usage for my code and thus I am trying to find the memory leak in my code.

Below is the output from memory_profiler on a paired read (where I open two files and zip the records), for ONLY 10**5 values (the default value get overwritten).

Line #    Mem usage    Increment   Line Contents
================================================
   137   27.488 MiB    0.000 MiB   @profile
   138                             def paired_read(read1, read2, nbrofitems = 10**8):
   139                              """ Procedure for reading both sequences and stitching them together """
   140   27.488 MiB    0.000 MiB    seqFreqs = Counter()
   141   27.488 MiB    0.000 MiB    linker_str = "~"
   142                              #for rec1, rec2 in izip(read1, read2):
   143 3013.402 MiB 2985.914 MiB    for rec1, rec2 in islice(izip(read1, read2), nbrofitems):
   144 3013.398 MiB   -0.004 MiB        rec1 = rec1[9:]                         # Trim the primer variable sequence
   145 3013.398 MiB    0.000 MiB        rec2 = rec2[:150].reverse_complement()  # Trim the low quality half of the 3' read AND take rev complement
   146                                  #aaSeq = Seq.translate(rec1 + rec2)
   147                             
   148                                  global nseqs 
   149 3013.398 MiB    0.000 MiB        nseqs += 1
   150                             
   151 3013.402 MiB    0.004 MiB        if filter_seq(rec1, direction=5) and filter_seq(rec2, direction=3):
   152 3013.395 MiB   -0.008 MiB            aakey = str(Seq.translate(rec1)) + linker_str + str(Seq.translate(rec2))
   153 3013.395 MiB    0.000 MiB            seqFreqs.update({ aakey : 1 })  
   154                                  
   155 3013.402 MiB    0.008 MiB    print "========================================"
   156 3013.402 MiB    0.000 MiB    print "# of total sequences: %d" % nseqs
   157 3013.402 MiB    0.000 MiB    print "# of filtered sequences: %d" % sum(seqFreqs.values())
   158 3013.461 MiB    0.059 MiB    print "# of repeated occurances: %d" % (sum(seqFreqs.values()) - len(list(seqFreqs)))
   159 3013.461 MiB    0.000 MiB    print "# of low-score sequences (<20): %d" % lowQSeq
   160 3013.461 MiB    0.000 MiB    print "# of sequences with stop codon: %d" % starSeqs
   161 3013.461 MiB    0.000 MiB    print "========================================"
   162 3013.504 MiB    0.043 MiB    pprint(seqFreqs.most_common(100), width = 240)

The code, in short, does some filtering on the records and keeps track of how many times the strings occur in the file (zipped pair of strings in this particular case).

100 000 strings of 150 chars with integer values in a Counter should land around 100 MBs tops, which I checked using following function by @AaronHall .

Given the memory_profiler output I suspect that islice doesn't let go of the previous entities over the course of the iteration. A google search landed me at this bug report however it's marked as solved for Python 2.7 which is what I am running at the moment.

Any opinions?

EDIT: I have tried to skip islice as per comment below and use a for loop like

for rec in list(next(read1) for _ in xrange(10**5)):

which makes no significant difference. It is in the case of a single file, in order to avoid izip which also comes from itertools .

A secondary troubleshooting idea i had was to check if gzip.open() reads and expands the file to memory, and thus cause the issue here. However running the script on decompressed files doesn't make a difference.

Note that memory_profiler only reports the maximum memory consumption for each line. For long loops this can be misleading as the first line of the loop always seem to report a disproportionate amount of memory.

That is because it compares the first line of the loop with respect to memory consumption of the line before, which would be out of the loop. It doesn't mean that the first line of the loop consumes 2985Mb but rather that the difference between the peak in memory within the loop is 2985Mb higher that out of the loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM