简体   繁体   English

Dryscrape/webkit_server 内存泄漏

[英]Dryscrape/webkit_server memory leak

I'm using dryscrape/webkit_server for scraping javascript enabled websites.我正在使用 dryscrape/webkit_server 抓取启用了 javascript 的网站。

The memory usage of the process webkit_server seems to increase with each call to session.visit().进程 webkit_server 的内存使用似乎随着每次调用 session.visit() 而增加。 It happens to me using the following script:使用以下脚本发生在我身上:

import dryscrape

for url in urls: 
    session = dryscrape.Session()
    session.set_timeout(10)
    session.set_attribute('auto_load_images', False)
    session.visit(url)
    response = session.body()

I'm iterating over approx.我正在迭代大约。 300 urls and after 70-80 urls webkit_server takes up about 3GB of memory. 300 个 url 和 70-80 个 url 之后 webkit_server 占用大约 3GB 的内存。 However it is not really the memory that is the problem for me, but it seems that dryscrape/webkit_server is getting slower with each iteration.然而,对我来说,这并不是真正的内存问题,但似乎 dryscrape/webkit_server 每次迭代都变得越来越慢。 After the said 70-80 iterations dryscrape is so slow that it raises a timeout error (set timeout = 10 sec) and I need to abort the python script.在上述 70-80 次迭代后,dryscrape 太慢了,它引发了超时错误(设置超时 = 10 秒),我需要中止 python 脚本。 Restarting the webkit_server (eg after every 30 iterations) might help and would empty the memory, however I'm unsure if the 'memory leaks' are really responsible for dry scrape getting slower and slower.重新启动 webkit_server(例如,每 30 次迭代后)可能会有所帮助并且会清空内存,但是我不确定“内存泄漏”是否真的是导致干刮越来越慢的原因。

Does anyone know how to restart the webkit_server so I could test that?有谁知道如何重新启动 webkit_server 以便我可以测试?

I have not found an acceptable workaround for this issue, however I also don't want to switch to another solution (selenium/phantomjs, ghost.py) as I simply love dryscrape for its simplicity.我还没有找到解决此问题的可接受的解决方法,但是我也不想切换到另一个解决方案(selenium/phantomjs、ghost.py),因为我只是喜欢 dryscrape 的简单性。 Dryscrape is working great btw.顺便说一句,Dryscrape 效果很好。 if one is not iterating over too many urls in one session.如果在一个会话中没有迭代太多的 url。

This issue is also discussed here这个问题也在这里讨论

https://github.com/niklasb/dryscrape/issues/41 https://github.com/niklasb/dryscrape/issues/41

and here在这里

Webkit_server (called from python's dryscrape) uses more and more memory with each page visited. Webkit_server(从python 的dryscrape 调用)随着访问的每个页面使用越来越多的内存。 How do I reduce the memory used? 如何减少使用的内存?

The memory leak you're having may also be related to the fact the webkit_process is never actually killed (and that you're spawning a new dryscrape.Session every iteration, that spawns a webkit_server process in the background that never gets killed).您遇到的内存泄漏也可能与 webkit_process 从未真正被杀死的事实有关(并且您每次迭代都会产生一个新的 dryscrape.Session,它在后台产生一个从未被杀死的 webkit_server 进程)。 So it will just keep spawning a new process every time it restarts.因此,它每次重新启动时都会继续产生一个新进程。 @Kenneth answer may work but any solution that requires calling command line is sketchy. @Kenneth 答案可能有效,但任何需要调用命令行的解决方案都是粗略的。 A better solution would be to declare the session once at the beginning and kill the webkit_server process from python at the end:更好的解决方案是在开始时声明一次会话并在最后从 python 中终止 webkit_server 进程:

import webkit_server
import dryscrape

server = webkit_server.Server()
server_conn = webkit_server.ServerConnection(server=server)
driver = dryscrape.driver.webkit.Driver(connection=server_conn)
sess = dryscrape.Session(driver=driver)
# set session settings as needed here

for url in urls:
    sess.visit(url)
    response = session.body()
    sess.reset()

server.kill() # the crucial line!

Frankly, this is a shortcoming in the dryscrape library.坦白说,这是dryscrape库的一个缺点。 The kill command should be accessible from the dryscrape Session.应该可以从 dryscrape Session 访问 kill 命令。

Hi,嗨,

Sorry for digging up this old post but what i did to solve the issue(After googling and only finding this post) was to run the dryscrape in a seperate process and then killing Xvfb after each run.很抱歉挖掘了这篇旧帖子,但我为解决这个问题所做的(在谷歌搜索后只找到了这篇文章)是在单独的过程中运行dryscrape,然后在每次运行后杀死 Xvfb。

So my dryscrape script is:所以我的干刮脚本是:

dryscrape.start_xvfb()
session = dryscrape.Session()
session.set_attribute('auto_load_images', False)
session.visit(sys.argv[1])
print session.body().encode('utf-8')

And to run it:并运行它:

p = subprocess.Popen(["python", "dryscrape.py", url],
                     stdout=subprocess.PIPE)
result = p.stdout.read()
print "Killing all Xvfb"
os.system("sudo killall Xvfb")

I know it's not the best way, and the memory leak should be fixed, but this works.我知道这不是最好的方法,应该修复内存泄漏,但这有效。

I have had the same problem with memory leaking.我遇到了与内存泄漏相同的问题。 Solved it by resetting session after every page view!通过在每次页面查看后重置会话来解决它!

Simplified workflow would look like this.简化的工作流程如下所示。

Setting up server:设置服务器:

dryscrape.start_xvfb()
sess = dryscrape.Session()

Then iterate through Url's and reset session after every url然后遍历 URL 并在每个 URL 之后重置会话

for url in urls:
    sess.set_header('user-agent', 'Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36')
    sess.set_attribute('auto_load_images', False)
    sess.set_timeout(30)
    sess.visit(url)
    response = sess.body()
    sess.reset()

Update更新

I still have encountered the problem with memory leak and better answer is the one provided by @nico .我仍然遇到了内存泄漏问题,更好的答案是@nico提供的答案

I have ended up abandoning dryscrape all together and now been using Selenium and PhantomJS.我最终放弃了干刮,现在一直在使用 Selenium 和 PhantomJS。 There are still memory leaks but they are manageable.仍然存在内存泄漏,但它们是可以管理的。

Make 2 scripts like this像这样制作2个脚本

call.py调用.py

import os
#read urls.txt make a list
urls = open('urls.txt').read().split('\n')
for url in urls:
  print(url)
  os.system("./recive_details.py %s" % url)

recive_details.py recive_details.py

import sys
url = sys.argv[1]
import dryscrape as d
d.start_xvfb()
br = d.Session()
br.visit(url)
#do something here
#print title
print br.xpath("//title")[0].text()

Run Always call.py like this "python call.py" it will automatically execute 2nd script and kill session immediately.像这样“python call.py”一样运行Always call.py,它将自动执行第二个脚本并立即终止会话。 i try many other methods but this method work for me like a magic try this once我尝试了许多其他方法,但这种方法对我有用,就像魔术一样尝试一次

Omitting session.set_attribute('auto_load_images', False) resolved the issue for me as described here .省略session.set_attribute('auto_load_images', False)所描述的解决了这个问题,我在这里 It seems there is a memory leak when images are not loaded.未加载图像时似乎存在内存泄漏。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM