简体   繁体   English

python3 urllib.request将永远在gevent中阻止

[英]python3 urllib.request will block forever in gevent

I want to write a spider program to download web pages using gevent in python3. 我想编写一个蜘蛛程序来在python3中使用gevent下载网页。 Here is my code: 这是我的代码:

import gevent
import gevent.pool
import gevent.monkey
import urllib.request

gevent.monkey.patch_all()

def download(url):
    return urllib.request.urlopen(url).read(10)

urls = ['http://www.google.com'] * 100
jobs = [gevent.spawn(download, url) for url in urls]
gevent.joinall(jobs)

But when I run it, there is an error: 但是当我运行它时,出现一个错误:

Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/gevent/greenlet.py", line 340, in run
result = self._run(*self.args, **self.kwargs)
File "e.py", line 8, in download
return urllib.request.urlopen(url).read(10)
File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
return opener.open(url, data, timeout)

......
return greenlet.switch(self)
gevent.hub.LoopExit: This operation would block forever
<Greenlet at 0x7f4b33d2fdf0: download('http://www.google.com')> failed with LoopExit
......

It seems that the urllib.request blocks, so the program can not work. 似乎urllib.request阻止了,所以程序无法正常工作。 How to solve it? 怎么解决呢?

It could be due to the setting of the proxy when it is within a company network. 这可能是由于代理位于公司网络中时的设置所致。 Personal recommendation is to use Selenium in combination with beautiful soup which uses the browser to open the url link and you can download html content or control the browser directly. 个人建议是将Selenium与漂亮的汤结合使用,后者使用浏览器打开URL链接,您可以下载html内容或直接控制浏览器。 Hope it helps 希望能帮助到你

from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Ie()
url = "http://www.google.com"
browser.get(url)
html_source = browser.page_source
soup = BeautifulSoup(html_source, "lxml")
print(soup)
browser.close()

the same problem as in Python, gevent, urllib2.urlopen.read(), download accelerator . Python,gevent,urllib2.urlopen.read(),下载加速器中的问题相同

to reiterate from the said post: 从上述帖子中重申:

the argument to read is a number of bytes, not an offset. 要读取的参数是字节数,而不是偏移量。

also: 也:

You're trying to read a response to a single request from different greenlets. 您正在尝试读取对来自不同greenlets的单个请求的响应。

If you'd like to download the same file using several concurrent connections then you could use Range http header if the server supports it (you get 206 status instead of 200 for the request with Range header). 如果您想使用多个并发连接下载同一文件,则可以在服务器支持的情况下使用Range http标头(对于具有Range标头的请求,您将获得206状态,而不是200状态)。 See HTTPRangeHandler. 请参阅HTTPRangeHandler。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM