简体   繁体   English

为什么此lxml.etree.HTMLPullParser泄漏内存?

[英]Why is this lxml.etree.HTMLPullParser leaking memory?

I'm trying to use lxml's HTMLPullParser on Linux Mint but I'm finding that the memory usage keeps increasing and I'm not sure why. 我试图在Linux Mint上使用lxml的HTMLPullParser,但发现内存使用量一直在增加,我不确定为什么。 Here's my test code: 这是我的测试代码:

# -*- coding: utf-8 -*-
from __future__ import division, absolute_import, print_function, unicode_literals
import lxml.etree
import resource
from io import DEFAULT_BUFFER_SIZE

for _ in xrange(1000):
with open('StackOverflow.html', 'r') as f:
    parser = lxml.etree.HTMLPullParser()
    while True:
        buf = f.read(DEFAULT_BUFFER_SIZE)
        if not buf: break
        parser.feed(buf)
    parser.close()

    # Print memory usage
    print((resource.getrusage(resource.RUSAGE_SELF)[2] * resource.getpagesize())/1000000.0)

StackOverflow.html is the homepage of stackoverflow that I've saved in the same folder as the python script. StackOverflow.html是我与python脚本保存在同一文件夹中的stackoverflow的主页。 I've tried adding explicit deletes and clears but so far nothing has worked. 我尝试添加显式删除和清除,但到目前为止没有任何效果。 What am I doing wrong? 我究竟做错了什么?

Elements constructed by the parsers are leaking, and I can't see an API contract violation in your code that's causing it. 解析器构造的元素正在泄漏,在导致该错误的代码中,我看不到API合同违规。 Since the objects survive a manual garbage collection run with gc.collect() , your best bet is probably to try a different parsing strategy as a workaround. 由于对象可以幸免于使用gc.collect()进行的手动垃圾回收, gc.collect()最好的选择是尝试使用其他解析策略作为解决方法。

To see the root cause, I used the memory exploration module objgraph and installed xdot to view the graphs it created. 为了查看根本原因,我使用了内存探索模块objgraph并安装了xdot来查看其创建的图形。

Before running the code, I ran: 在运行代码之前,我运行了:

In [3]: import objgraph

In [4]: objgraph.show_growth()

After running the code, I ran: 运行代码后,我运行:

In [6]: objgraph.show_growth()
tuple                  1616      +147
_Element                146      +146
list                   1100       +24
wrapper_descriptor     1423       +15
weakref                1155        +6
getset_descriptor       677        +4
dict                   2777        +4
member_descriptor       315        +3
method_descriptor       891        +2
_TempStore                2        +1

In [7]: import random

In [8]: objgraph.show_chain(
   ...: objgraph.find_backref_chain(
   ...: random.choice(objgraph.by_type('_Element')), objgraph.is_proper_module))
Graph written to /tmp/objgraph-bfuwa9.dot (8 nodes)
Spawning graph viewer (xdot)

Note: the numbers might be different than what you see depending on the webpage viewed. 注意:根据所查看的网页,数字可能与您看到的数字不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM