在islice的发生器循环中可能存在内存泄漏

Question

我正在处理每个包含数百万条记录的大文件（大约2GB解压缩，几百MB的gzip）。

我使用islice迭代记录，这使得我可以获得一小部分（用于调试和开发）或者当我想测试代码时。 我注意到我的代码的内存使用量非常大，因此我试图在代码中找到内存泄漏。

下面是配对读取中memory_profiler的输出（我打开两个文件并压缩记录），只有10 ** 5值（默认值被覆盖）。

Line #    Mem usage    Increment   Line Contents
================================================
   137   27.488 MiB    0.000 MiB   @profile
   138                             def paired_read(read1, read2, nbrofitems = 10**8):
   139                              """ Procedure for reading both sequences and stitching them together """
   140   27.488 MiB    0.000 MiB    seqFreqs = Counter()
   141   27.488 MiB    0.000 MiB    linker_str = "~"
   142                              #for rec1, rec2 in izip(read1, read2):
   143 3013.402 MiB 2985.914 MiB    for rec1, rec2 in islice(izip(read1, read2), nbrofitems):
   144 3013.398 MiB   -0.004 MiB        rec1 = rec1[9:]                         # Trim the primer variable sequence
   145 3013.398 MiB    0.000 MiB        rec2 = rec2[:150].reverse_complement()  # Trim the low quality half of the 3' read AND take rev complement
   146                                  #aaSeq = Seq.translate(rec1 + rec2)
   147                             
   148                                  global nseqs 
   149 3013.398 MiB    0.000 MiB        nseqs += 1
   150                             
   151 3013.402 MiB    0.004 MiB        if filter_seq(rec1, direction=5) and filter_seq(rec2, direction=3):
   152 3013.395 MiB   -0.008 MiB            aakey = str(Seq.translate(rec1)) + linker_str + str(Seq.translate(rec2))
   153 3013.395 MiB    0.000 MiB            seqFreqs.update({ aakey : 1 })  
   154                                  
   155 3013.402 MiB    0.008 MiB    print "========================================"
   156 3013.402 MiB    0.000 MiB    print "# of total sequences: %d" % nseqs
   157 3013.402 MiB    0.000 MiB    print "# of filtered sequences: %d" % sum(seqFreqs.values())
   158 3013.461 MiB    0.059 MiB    print "# of repeated occurances: %d" % (sum(seqFreqs.values()) - len(list(seqFreqs)))
   159 3013.461 MiB    0.000 MiB    print "# of low-score sequences (<20): %d" % lowQSeq
   160 3013.461 MiB    0.000 MiB    print "# of sequences with stop codon: %d" % starSeqs
   161 3013.461 MiB    0.000 MiB    print "========================================"
   162 3013.504 MiB    0.043 MiB    pprint(seqFreqs.most_common(100), width = 240)

简而言之，代码对记录进行了一些过滤，并跟踪字符串在文件中出现的次数（在这种特定情况下为压缩字符串对）。

计数器中100个字符串的150个字符和整数值应该落在100 MB的顶部，我使用@AaronHall的后续函数进行检查。

鉴于memory_profiler输出，我怀疑islice在迭代过程中不会释放前面的实体。 谷歌搜索让我看到了这个错误报告但是它已被标记为已解决的Python 2.7，这正是我目前正在运行的。

任何意见？

编辑：我已经尝试按照下面的评论跳过islice并使用for循环之类的

for rec in list(next(read1) for _ in xrange(10**5)):

这没有什么显着差异。 它是在单个文件的情况下，以避免也来自itertools izip 。

我的第二个故障排除想法是检查gzip.open()读取并将文件扩展到内存，从而导致此问题。 但是，在解压缩文件上运行脚本没有任何区别。

Answer 1

请注意，memory_profiler仅报告每行的最大内存消耗。 对于长循环，这可能会产生误导，因为循环的第一行似乎总是报告不成比例的内存量。

这是因为它将循环的第一行与之前的行的内存消耗进行比较，这将超出循环。 这并不意味着循环的第一行消耗2985Mb，而是循环内存峰值之间的差异比循环外高2985Mb。

在islice的发生器循环中可能存在内存泄漏

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-03-18 12:20:54

在islice的发生器循环中可能存在内存泄漏

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-03-18 12:20:54

解决方案1
3 已采纳 2016-03-18 12:20:54