Twisted getPage（）：请求大量页面时进程内存增长

Question

I am writing a script for contstant (each 30-120 sec) grabbing of information quering a large set of URLs (Icecast/Shoutcast servers status pages), about 500 urls. 我正在为contstant（每个30-120秒）编写一个脚本来获取大量URL（Icecast / Shoutcast服务器状态页面）的信息，大约500个网址。 It works fine, but the python process resident size is constantly keep growing. 它工作正常，但python进程驻留大小不断增长。 I am sure it is endless grow as I left it running for several hours and it took 1.2Gb RES from initial 30Mb. 我确信它会无限增长，因为我让它运行了几个小时，从最初的30Mb开始需要1.2Gb RES。

I simplified the script for easy understanding to the following: 我简化了脚本以便于理解以下内容：

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.enterprise import adbapi

def ok(res, url):
    print "OK: " + str(url)
    reactor.callLater(30, load, url)

def error(res, url):
    print "FAIL: " + str(url)
    reactor.callLater(30, load, url)

def db_ok(res):
    for item in res:
        if item[1]:
            print "ADDED: " + str(item[1])
            reactor.callLater(30, load, item[1])

def db_error(res):
    print "Database error: " + str(res)
    reactor.stop()

def load(url):
    d = getPage(url,
                headers={"Accept": "text/html"},
                timeout=30)
    d.addCallback(ok, url)
    d.addErrback(error, url)


dbpool = adbapi.ConnectionPool("MySQLdb", "host", "user", "passwd", db="db")
q = dbpool.runQuery("SELECT id, url FROM stations")
q.addCallback(db_ok).addErrback(db_error)

reactor.run()

It grows the same as the original daemon and thus I localized the problem. 它与原始守护进程一样增长，因此我对问题进行了本地化。 I think it is related to twisted.web.client.getPage() somehow. 我认为它与twisted.web.client.getPage（）不知何故有关。 In original daemon I have used twisted.manhole for heap evalutaions with meliae during run time, but do not see anything nasty. 在原始的守护进程中，我在运行时使用了twisted.manhole用于堆积evalutaions与meliae，但没有看到任何令人讨厌的东西。

First meliae dump made in right after only 1 or 2 query cycles are finished: 在完成1或2个查询周期后立即进行的第一个meliae转储：

Total 84313 objects, 188 types, Total size = 15.9MiB (16647235 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0    5806   6   4142800  24  24  786712 dict
     1   28070  33   2223457  13  38    4874 str
     2     612   0   1636992   9  48    3424 HTTPClientFactory
     3   19599  23   1585720   9  57     608 tuple
     4     643   0    720160   4  61    1120 DelayedCall
     5     642   0    713904   4  66    1112 Client
     6     617   0    691040   4  70    1120 Connector
     7     639   0    577656   3  73     904 type
     8     691   0    556576   3  77    1120 Deferred
     9    3962   4    475440   2  80     120 function
    10    3857   4    462840   2  82     120 code
    11    3017   3    308192   1  84    4856 list
    12     240   0    266880   1  86    1112 Method
    13    2968   3    237440   1  87      80 instancemethod
    14     612   0    215424   1  88     352 InsensitiveDict
    15     217   0    211128   1  90   12624 module
    16    2185   2    157320   0  91      72 builtin_function_or_method
    17     107   0    119840   0  91    1120 HTTPPageGetter
    18     343   0    117992   0  92     344 IcecastRadioStation
    19     343   0    117992   0  93     344 HTTPExtractor

And top around that time: 最重要的是：

VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
248m  27m 4152 R   92  1.6   0:09.21 python

Now we wait for a time and check again, this is the picture after 20 minutes running (around 40 query cycles): 现在我们等待一段时间并再次检查，这是运行20分钟后的图片（大约40个查询周期）：

Total 67428 objects, 188 types, Total size = 11.9MiB (12463799 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0    3865   5   3601624  28  28  786712 dict
     1   23762  35   2002029  16  44    4874 str
     2   16382  24   1346208  10  55     608 tuple
     3     644   0    582176   4  60     904 type
     4     174   0    554304   4  64    3424 HTTPClientFactory
     5     456   0    510720   4  68    1120 DelayedCall
     6    3963   5    475560   3  72     120 function
     7    3857   5    462840   3  76     120 code
     8     240   0    266880   2  78    1112 Method
     9     237   0    263544   2  80    1112 Client
    10     217   0    211128   1  82   12624 module
    11     187   0    209440   1  84    1120 Connector
    12     182   0    194624   1  85    1120 Deferred
    13    1648   2    179696   1  87    3768 list
    14    1530   2    122400   0  88      80 instancemethod
    15     343   0    117992   0  89     344 IcecastRadioStation
    16     343   0    117992   0  90     344 HTTPExtractor
    17    1175   1    103400   0  90      88 weakref
    18    1109   1     88720   0  91      80 wrapper_descriptor
    19      75   0     83400   0  92    1112 InterfaceClass

And top: 顶部：

VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
532m 240m 4152 S   54 13.7   4:02.64 python

According to meliae neither objects count is growing nor total size. 根据meliae，对象数量不会增长也不会增加总尺寸。 But the process eat up 200Mb of resident memory during this 20 minutes. 但是这个过程在这20分钟内耗尽了200Mb的驻地记忆。

I also used valgrind on python, but no leaks are found. 我也在python上使用了valgrind，但没有发现泄漏。 Any thoughts? 有什么想法吗？

I am using Python version 2.6.6, twisted version 10.2.0 我正在使用Python版本2.6.6，扭曲版本10.2.0

Update #1: I also used valgrind massif to profile CPython memory usage, here is allocation tree for 99.93% of memory allocated: 更新＃1：我还使用valgrind massif来描述CPython内存使用情况，这里是分配树，分配了99.93％的内存：

 99.93% (578,647,287B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->94.69% (548,309,283B) 0x550819: O_cwrite (cStringIO.c:406) | ->94.69% (548,309,283B) 0x55096F: O_write (cStringIO.c:436) | ->94.69% (548,309,283B) 0x5A17F9: PyCFunction_Call (methodobject.c:81) | ->94.69% (548,309,283B) 0x4D1373: call_function (ceval.c:3750) | ->94.69% (548,309,283B) 0x4CC2A2: PyEval_EvalFrameEx (ceval.c:2412) | ->94.69% (548,309,283B) 0x4D1868: fast_function (ceval.c:3836) | ->94.69% (548,309,283B) 0x4D1549: call_function (ceval.c:3771) | ->94.69% (548,309,283B) 0x4CC2A2: PyEval_EvalFrameEx (ceval.c:2412) | ->94.69% (548,309,283B) 0x4D1868: fast_function (ceval.c:3836) | ->94.69% (548,309,283B) 0x4D1549: call_function (ceval.c:3771) | ->94.69% (548,309,283B) 0x4CC2A2: PyEval_EvalFrameEx (ceval.c:2412) | ->94.69% (548,309,283B) 0x4D1868: fast_function (ceval.c:3836) | ->94.69% (548,309,283B) 0x4D1549: call_function (ceval.c:3771) | ->94.69% (548,309,283B) 0x4CC2A2: PyEval_EvalFrameEx (ceval.c:2412) | ->94.69% (548,309,283B) 0x4D1868: fast_function (ceval.c:3836) | ->94.69% (548,309,283B) 0x4D1549: call_function (ceval.c:3771) | ->94.69% (548,309,283B) 0x4CC2A2: PyEval_EvalFrameEx (ceval.c:2412) | ->94.69% (548,309,283B) 0x4CEBB3: PyEval_EvalCodeEx (ceval.c:3000) | ->94.69% (548,309,283B) 0x5A0DC6: function_call (funcobject.c:524) | ->94.69% (548,309,283B) 0x4261E8: PyObject_Call (abstract.c:2492) | ->94.69% (548,309,283B) 0x4D2870: ext_do_call (ceval.c:4063) | ->94.69% (548,309,283B) 0x4CC4E3: PyEval_EvalFrameEx (ceval.c:2452) | ->94.69% (548,309,283B) 0x4CEBB3: PyEval_EvalCodeEx (ceval.c:3000) | ->94.69% (548,309,283B) 0x5A0DC6: function_call (funcobject.c:524) | ->94.69% (548,309,283B) 0x4261E8: PyObject_Call (abstract.c:2492) | ->94.69% (548,309,283B) 0x4D2870: ext_do_call (ceval.c:4063) | ->94.69% (548,309,283B) 0x4CC4E3: PyEval_EvalFrameEx (ceval.c:2452) | ->94.69% (548,309,283B) 0x4CEBB3: PyEval_EvalCodeEx (ceval.c:3000) | ->94.69% (548,309,283B) 0x5A0DC6: function_call (funcobject.c:524) | ->94.69% (548,309,283B) 0x4261E8: PyObject_Call (abstract.c:2492)

Answer 1

My guess is that you're scheduling these page fetches based on a fixed timer, and you're not paying attention to where the fetches actually end. 我的猜测是你正在根据一个固定的计时器安排这些页面提取，而你没有注意提取实际结束的位置。 Pretend that each page takes 60 seconds to fetch. 假装每个页面需要60秒才能获取。 You have a huge pile of fetches scheduled for 30 seconds, and then again in 30 seconds, more and more piling up as you're completing earlier requests. 你有大量的提取安排30秒，然后再在30秒内，当你完成早先的请求时，越来越多的堆积。 This is just a guess, though, as even this simplified example isn't completely self-contained. 但这只是猜测，因为即使是这个简化的例子也不是完全独立的。 (Can you reproduce it without having a database involved, with just a fixed list of URLs?) （你可以在没有涉及数据库的情况下重现它，只有一个固定的URL列表吗？）

That stack trace is not particularly helpful, either; 堆栈跟踪也不是特别有用; effectively, it just says that the memory was allocated by calling a python function, which should be obvious. 实际上，它只是说通过调用python函数来分配内存，这应该是显而易见的。 You might want to try a Python-specific memory profiler like Heapy or Dowser to see where your Python objects are going. 您可能想尝试像Heapy或Dowser这样的Python特定的内存分析器来查看Python对象的去向。

Twisted getPage（）：请求大量页面时进程内存增长

问题描述

1 个解决方案

解决方案1
1 2011-03-27 03:32:30

Twisted getPage（）：请求大量页面时进程内存增长

问题描述

1 个解决方案

解决方案1 1 2011-03-27 03:32:30

解决方案1
1 2011-03-27 03:32:30