为什么此lxml.etree.HTMLPullParser泄漏内存？

Question

I'm trying to use lxml's HTMLPullParser on Linux Mint but I'm finding that the memory usage keeps increasing and I'm not sure why. 我试图在Linux Mint上使用lxml的HTMLPullParser，但发现内存使用量一直在增加，我不确定为什么。 Here's my test code: 这是我的测试代码：

# -*- coding: utf-8 -*-
from __future__ import division, absolute_import, print_function, unicode_literals
import lxml.etree
import resource
from io import DEFAULT_BUFFER_SIZE

for _ in xrange(1000):
with open('StackOverflow.html', 'r') as f:
    parser = lxml.etree.HTMLPullParser()
    while True:
        buf = f.read(DEFAULT_BUFFER_SIZE)
        if not buf: break
        parser.feed(buf)
    parser.close()

    # Print memory usage
    print((resource.getrusage(resource.RUSAGE_SELF)[2] * resource.getpagesize())/1000000.0)

StackOverflow.html is the homepage of stackoverflow that I've saved in the same folder as the python script. StackOverflow.html是我与python脚本保存在同一文件夹中的stackoverflow的主页。 I've tried adding explicit deletes and clears but so far nothing has worked. 我尝试添加显式删除和清除，但到目前为止没有任何效果。 What am I doing wrong? 我究竟做错了什么？

Answer 1

Elements constructed by the parsers are leaking, and I can't see an API contract violation in your code that's causing it. 解析器构造的元素正在泄漏，在导致该错误的代码中，我看不到API合同违规。 Since the objects survive a manual garbage collection run with gc.collect() , your best bet is probably to try a different parsing strategy as a workaround. 由于对象可以幸免于使用gc.collect()进行的手动垃圾回收， gc.collect()最好的选择是尝试使用其他解析策略作为解决方法。

To see the root cause, I used the memory exploration module objgraph and installed xdot to view the graphs it created. 为了查看根本原因，我使用了内存探索模块objgraph并安装了xdot来查看其创建的图形。

Before running the code, I ran: 在运行代码之前，我运行了：

In [3]: import objgraph

In [4]: objgraph.show_growth()

After running the code, I ran: 运行代码后，我运行：

In [6]: objgraph.show_growth()
tuple                  1616      +147
_Element                146      +146
list                   1100       +24
wrapper_descriptor     1423       +15
weakref                1155        +6
getset_descriptor       677        +4
dict                   2777        +4
member_descriptor       315        +3
method_descriptor       891        +2
_TempStore                2        +1

In [7]: import random

In [8]: objgraph.show_chain(
   ...: objgraph.find_backref_chain(
   ...: random.choice(objgraph.by_type('_Element')), objgraph.is_proper_module))
Graph written to /tmp/objgraph-bfuwa9.dot (8 nodes)
Spawning graph viewer (xdot)

Note: the numbers might be different than what you see depending on the webpage viewed. 注意：根据所查看的网页，数字可能与您看到的数字不同。

为什么此lxml.etree.HTMLPullParser泄漏内存？

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-04-29 17:02:13

为什么此lxml.etree.HTMLPullParser泄漏内存？

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-04-29 17:02:13

解决方案1
1 已采纳 2015-04-29 17:02:13