简体   繁体   English

在islice的发生器循环中可能存在内存泄漏

[英]Likely memory leak in generator loop with islice

I am working with large files holding several million records each (approx 2GB unpacked, several hundred MBs gzip). 我正在处理每个包含数百万条记录的大文件(大约2GB解压缩,几百MB的gzip)。

I iterate over the records with islice , which allows me to get either a small portion (for debug and development) or the whole thing when I want to test the code. 我使用islice迭代记录,这使得我可以获得一小部分(用于调试和开发)或者当我想测试代码时。 I have noticed an absurdly large memory usage for my code and thus I am trying to find the memory leak in my code. 我注意到我的代码的内存使用量非常大,因此我试图在代码中找到内存泄漏。

Below is the output from memory_profiler on a paired read (where I open two files and zip the records), for ONLY 10**5 values (the default value get overwritten). 下面是配对读取中memory_profiler的输出(我打开两个文件并压缩记录),只有10 ** 5值(默认值被覆盖)。

Line #    Mem usage    Increment   Line Contents
================================================
   137   27.488 MiB    0.000 MiB   @profile
   138                             def paired_read(read1, read2, nbrofitems = 10**8):
   139                              """ Procedure for reading both sequences and stitching them together """
   140   27.488 MiB    0.000 MiB    seqFreqs = Counter()
   141   27.488 MiB    0.000 MiB    linker_str = "~"
   142                              #for rec1, rec2 in izip(read1, read2):
   143 3013.402 MiB 2985.914 MiB    for rec1, rec2 in islice(izip(read1, read2), nbrofitems):
   144 3013.398 MiB   -0.004 MiB        rec1 = rec1[9:]                         # Trim the primer variable sequence
   145 3013.398 MiB    0.000 MiB        rec2 = rec2[:150].reverse_complement()  # Trim the low quality half of the 3' read AND take rev complement
   146                                  #aaSeq = Seq.translate(rec1 + rec2)
   147                             
   148                                  global nseqs 
   149 3013.398 MiB    0.000 MiB        nseqs += 1
   150                             
   151 3013.402 MiB    0.004 MiB        if filter_seq(rec1, direction=5) and filter_seq(rec2, direction=3):
   152 3013.395 MiB   -0.008 MiB            aakey = str(Seq.translate(rec1)) + linker_str + str(Seq.translate(rec2))
   153 3013.395 MiB    0.000 MiB            seqFreqs.update({ aakey : 1 })  
   154                                  
   155 3013.402 MiB    0.008 MiB    print "========================================"
   156 3013.402 MiB    0.000 MiB    print "# of total sequences: %d" % nseqs
   157 3013.402 MiB    0.000 MiB    print "# of filtered sequences: %d" % sum(seqFreqs.values())
   158 3013.461 MiB    0.059 MiB    print "# of repeated occurances: %d" % (sum(seqFreqs.values()) - len(list(seqFreqs)))
   159 3013.461 MiB    0.000 MiB    print "# of low-score sequences (<20): %d" % lowQSeq
   160 3013.461 MiB    0.000 MiB    print "# of sequences with stop codon: %d" % starSeqs
   161 3013.461 MiB    0.000 MiB    print "========================================"
   162 3013.504 MiB    0.043 MiB    pprint(seqFreqs.most_common(100), width = 240)

The code, in short, does some filtering on the records and keeps track of how many times the strings occur in the file (zipped pair of strings in this particular case). 简而言之,代码对记录进行了一些过滤,并跟踪字符串在文件中出现的次数(在这种特定情况下为压缩字符串对)。

100 000 strings of 150 chars with integer values in a Counter should land around 100 MBs tops, which I checked using following function by @AaronHall . 计数器中100个字符串的150个字符和整数值应该落在100 MB的顶部,我使用@AaronHall的后续函数进行检查

Given the memory_profiler output I suspect that islice doesn't let go of the previous entities over the course of the iteration. 鉴于memory_profiler输出,我怀疑islice在迭代过程中不会释放前面的实体。 A google search landed me at this bug report however it's marked as solved for Python 2.7 which is what I am running at the moment. 谷歌搜索让我看到了这个错误报告但是它已被标记为已解决的Python 2.7,这正是我目前正在运行的。

Any opinions? 任何意见?

EDIT: I have tried to skip islice as per comment below and use a for loop like 编辑:我已经尝试按照下面的评论跳过islice并使用for循环之类的

for rec in list(next(read1) for _ in xrange(10**5)):

which makes no significant difference. 这没有什么显着差异。 It is in the case of a single file, in order to avoid izip which also comes from itertools . 它是在单个文件的情况下,以避免也来自itertools izip

A secondary troubleshooting idea i had was to check if gzip.open() reads and expands the file to memory, and thus cause the issue here. 我的第二个故障排除想法是检查gzip.open()读取并将文件扩展到内存,从而导致此问题。 However running the script on decompressed files doesn't make a difference. 但是,在解压缩文件上运行脚本没有任何区别。

Note that memory_profiler only reports the maximum memory consumption for each line. 请注意,memory_profiler仅报告每行的最大内存消耗。 For long loops this can be misleading as the first line of the loop always seem to report a disproportionate amount of memory. 对于长循环,这可能会产生误导,因为循环的第一行似乎总是报告不成比例的内存量。

That is because it compares the first line of the loop with respect to memory consumption of the line before, which would be out of the loop. 这是因为它将循环的第一行与之前的行的内存消耗进行比较,这将超出循环。 It doesn't mean that the first line of the loop consumes 2985Mb but rather that the difference between the peak in memory within the loop is 2985Mb higher that out of the loop. 这并不意味着循环的第一行消耗2985Mb,而是循环内存峰值之间的差异比循环外高2985Mb。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM