[英]Twisted getPage(): process memory grow when requesting lot of pages
I am writing a script for contstant (each 30-120 sec) grabbing of information quering a large set of URLs (Icecast/Shoutcast servers status pages), about 500 urls. 我正在为contstant(每个30-120秒)编写一个脚本来获取大量URL(Icecast / Shoutcast服务器状态页面)的信息,大约500个网址。 It works fine, but the python process resident size is constantly keep growing.
它工作正常,但python进程驻留大小不断增长。 I am sure it is endless grow as I left it running for several hours and it took 1.2Gb RES from initial 30Mb.
我确信它会无限增长,因为我让它运行了几个小时,从最初的30Mb开始需要1.2Gb RES。
I simplified the script for easy understanding to the following: 我简化了脚本以便于理解以下内容:
from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.enterprise import adbapi
def ok(res, url):
print "OK: " + str(url)
reactor.callLater(30, load, url)
def error(res, url):
print "FAIL: " + str(url)
reactor.callLater(30, load, url)
def db_ok(res):
for item in res:
if item[1]:
print "ADDED: " + str(item[1])
reactor.callLater(30, load, item[1])
def db_error(res):
print "Database error: " + str(res)
reactor.stop()
def load(url):
d = getPage(url,
headers={"Accept": "text/html"},
timeout=30)
d.addCallback(ok, url)
d.addErrback(error, url)
dbpool = adbapi.ConnectionPool("MySQLdb", "host", "user", "passwd", db="db")
q = dbpool.runQuery("SELECT id, url FROM stations")
q.addCallback(db_ok).addErrback(db_error)
reactor.run()
It grows the same as the original daemon and thus I localized the problem. 它与原始守护进程一样增长,因此我对问题进行了本地化。 I think it is related to twisted.web.client.getPage() somehow.
我认为它与twisted.web.client.getPage()不知何故有关。 In original daemon I have used twisted.manhole for heap evalutaions with meliae during run time, but do not see anything nasty.
在原始的守护进程中,我在运行时使用了twisted.manhole用于堆积evalutaions与meliae,但没有看到任何令人讨厌的东西。
First meliae dump made in right after only 1 or 2 query cycles are finished: 在完成1或2个查询周期后立即进行的第一个meliae转储:
Total 84313 objects, 188 types, Total size = 15.9MiB (16647235 bytes)
Index Count % Size % Cum Max Kind
0 5806 6 4142800 24 24 786712 dict
1 28070 33 2223457 13 38 4874 str
2 612 0 1636992 9 48 3424 HTTPClientFactory
3 19599 23 1585720 9 57 608 tuple
4 643 0 720160 4 61 1120 DelayedCall
5 642 0 713904 4 66 1112 Client
6 617 0 691040 4 70 1120 Connector
7 639 0 577656 3 73 904 type
8 691 0 556576 3 77 1120 Deferred
9 3962 4 475440 2 80 120 function
10 3857 4 462840 2 82 120 code
11 3017 3 308192 1 84 4856 list
12 240 0 266880 1 86 1112 Method
13 2968 3 237440 1 87 80 instancemethod
14 612 0 215424 1 88 352 InsensitiveDict
15 217 0 211128 1 90 12624 module
16 2185 2 157320 0 91 72 builtin_function_or_method
17 107 0 119840 0 91 1120 HTTPPageGetter
18 343 0 117992 0 92 344 IcecastRadioStation
19 343 0 117992 0 93 344 HTTPExtractor
And top around that time: 最重要的是:
VIRT RES SHR S %CPU %MEM TIME+ COMMAND
248m 27m 4152 R 92 1.6 0:09.21 python
Now we wait for a time and check again, this is the picture after 20 minutes running (around 40 query cycles): 现在我们等待一段时间并再次检查,这是运行20分钟后的图片(大约40个查询周期):
Total 67428 objects, 188 types, Total size = 11.9MiB (12463799 bytes)
Index Count % Size % Cum Max Kind
0 3865 5 3601624 28 28 786712 dict
1 23762 35 2002029 16 44 4874 str
2 16382 24 1346208 10 55 608 tuple
3 644 0 582176 4 60 904 type
4 174 0 554304 4 64 3424 HTTPClientFactory
5 456 0 510720 4 68 1120 DelayedCall
6 3963 5 475560 3 72 120 function
7 3857 5 462840 3 76 120 code
8 240 0 266880 2 78 1112 Method
9 237 0 263544 2 80 1112 Client
10 217 0 211128 1 82 12624 module
11 187 0 209440 1 84 1120 Connector
12 182 0 194624 1 85 1120 Deferred
13 1648 2 179696 1 87 3768 list
14 1530 2 122400 0 88 80 instancemethod
15 343 0 117992 0 89 344 IcecastRadioStation
16 343 0 117992 0 90 344 HTTPExtractor
17 1175 1 103400 0 90 88 weakref
18 1109 1 88720 0 91 80 wrapper_descriptor
19 75 0 83400 0 92 1112 InterfaceClass
And top: 顶部:
VIRT RES SHR S %CPU %MEM TIME+ COMMAND
532m 240m 4152 S 54 13.7 4:02.64 python
According to meliae neither objects count is growing nor total size. 根据meliae,对象数量不会增长也不会增加总尺寸。 But the process eat up 200Mb of resident memory during this 20 minutes.
但是这个过程在这20分钟内耗尽了200Mb的驻地记忆。
I also used valgrind on python, but no leaks are found. 我也在python上使用了valgrind,但没有发现泄漏。 Any thoughts?
有什么想法吗?
I am using Python version 2.6.6, twisted version 10.2.0 我正在使用Python版本2.6.6,扭曲版本10.2.0
99.93% (578,647,287B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->94.69% (548,309,283B) 0x550819: O_cwrite (cStringIO.c:406) | ->94.69% (548,309,283B) 0x55096F: O_write (cStringIO.c:436) | ->94.69% (548,309,283B) 0x5A17F9: PyCFunction_Call (methodobject.c:81) | ->94.69% (548,309,283B) 0x4D1373: call_function (ceval.c:3750) | ->94.69% (548,309,283B) 0x4CC2A2: PyEval_EvalFrameEx (ceval.c:2412) | ->94.69% (548,309,283B) 0x4D1868: fast_function (ceval.c:3836) | ->94.69% (548,309,283B) 0x4D1549: call_function (ceval.c:3771) | ->94.69% (548,309,283B) 0x4CC2A2: PyEval_EvalFrameEx (ceval.c:2412) | ->94.69% (548,309,283B) 0x4D1868: fast_function (ceval.c:3836) | ->94.69% (548,309,283B) 0x4D1549: call_function (ceval.c:3771) | ->94.69% (548,309,283B) 0x4CC2A2: PyEval_EvalFrameEx (ceval.c:2412) | ->94.69% (548,309,283B) 0x4D1868: fast_function (ceval.c:3836) | ->94.69% (548,309,283B) 0x4D1549: call_function (ceval.c:3771) | ->94.69% (548,309,283B) 0x4CC2A2: PyEval_EvalFrameEx (ceval.c:2412) | ->94.69% (548,309,283B) 0x4D1868: fast_function (ceval.c:3836) | ->94.69% (548,309,283B) 0x4D1549: call_function (ceval.c:3771) | ->94.69% (548,309,283B) 0x4CC2A2: PyEval_EvalFrameEx (ceval.c:2412) | ->94.69% (548,309,283B) 0x4CEBB3: PyEval_EvalCodeEx (ceval.c:3000) | ->94.69% (548,309,283B) 0x5A0DC6: function_call (funcobject.c:524) | ->94.69% (548,309,283B) 0x4261E8: PyObject_Call (abstract.c:2492) | ->94.69% (548,309,283B) 0x4D2870: ext_do_call (ceval.c:4063) | ->94.69% (548,309,283B) 0x4CC4E3: PyEval_EvalFrameEx (ceval.c:2452) | ->94.69% (548,309,283B) 0x4CEBB3: PyEval_EvalCodeEx (ceval.c:3000) | ->94.69% (548,309,283B) 0x5A0DC6: function_call (funcobject.c:524) | ->94.69% (548,309,283B) 0x4261E8: PyObject_Call (abstract.c:2492) | ->94.69% (548,309,283B) 0x4D2870: ext_do_call (ceval.c:4063) | ->94.69% (548,309,283B) 0x4CC4E3: PyEval_EvalFrameEx (ceval.c:2452) | ->94.69% (548,309,283B) 0x4CEBB3: PyEval_EvalCodeEx (ceval.c:3000) | ->94.69% (548,309,283B) 0x5A0DC6: function_call (funcobject.c:524) | ->94.69% (548,309,283B) 0x4261E8: PyObject_Call (abstract.c:2492)
My guess is that you're scheduling these page fetches based on a fixed timer, and you're not paying attention to where the fetches actually end. 我的猜测是你正在根据一个固定的计时器安排这些页面提取,而你没有注意提取实际结束的位置。 Pretend that each page takes 60 seconds to fetch.
假装每个页面需要60秒才能获取。 You have a huge pile of fetches scheduled for 30 seconds, and then again in 30 seconds, more and more piling up as you're completing earlier requests.
你有大量的提取安排30秒,然后再在30秒内,当你完成早先的请求时,越来越多的堆积。 This is just a guess, though, as even this simplified example isn't completely self-contained.
但这只是猜测,因为即使是这个简化的例子也不是完全独立的。 (Can you reproduce it without having a database involved, with just a fixed list of URLs?)
(你可以在没有涉及数据库的情况下重现它,只有一个固定的URL列表吗?)
That stack trace is not particularly helpful, either; 堆栈跟踪也不是特别有用; effectively, it just says that the memory was allocated by calling a python function, which should be obvious.
实际上,它只是说通过调用python函数来分配内存,这应该是显而易见的。 You might want to try a Python-specific memory profiler like Heapy or Dowser to see where your Python objects are going.
您可能想尝试像Heapy或Dowser这样的Python特定的内存分析器来查看Python对象的去向 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.