简体   繁体   English

使用urllib2或任何其他http库读取超时

[英]Read timeout using either urllib2 or any other http library

I have code for reading an url like this: 我有读取这样的网址的代码:

from urllib2 import Request, urlopen
req = Request(url)
for key, val in headers.items():
    req.add_header(key, val)
res = urlopen(req, timeout = timeout)
# This line blocks
content = res.read()

The timeout works for the urlopen() call. 超时适用于urlopen()调用。 But then the code gets to the res.read() call where I want to read the response data and the timeout isn't applied there. 但是然后代码进入res.read()调用,我想要读取响应数据,并且不会在那里应用超时。 So the read call may hang almost forever waiting for data from the server. 因此,读取调用可能几乎永远挂起,等待来自服务器的数据。 The only solution I've found is to use a signal to interrupt the read() which is not suitable for me since I'm using threads. 我发现的唯一解决方案是使用一个信号来中断read(),因为我使用的是线程,所以不适合我。

What other options are there? 还有哪些其他选择? Is there a HTTP library for Python that handles read timeouts? 是否有一个Python的HTTP库来处理读取超时? I've looked at httplib2 and requests and they seem to suffer the same issue as above. 我看过httplib2和请求,他们似乎遇到了与上面相同的问题。 I don't want to write my own nonblocking network code using the socket module because I think there should already be a library for this. 我不想使用套接字模块编写自己的非阻塞网络代码,因为我认为应该已经有了一个库。

Update: None of the solutions below are doing it for me. 更新:以下解决方案都没有为我做。 You can see for yourself that setting the socket or urlopen timeout has no effect when downloading a large file: 您可以自己查看设置套接字或urlopen超时在下载大文件时无效:

from urllib2 import urlopen
url = 'http://iso.linuxquestions.org/download/388/7163/http/se.releases.ubuntu.com/ubuntu-12.04.3-desktop-i386.iso'
c = urlopen(url)
c.read()

At least on Windows with Python 2.7.3, the timeouts are being completely ignored. 至少在使用Python 2.7.3的Windows上,超时被完全忽略。

I found in my tests (using the technique described here ) that a timeout set in the urlopen() call also effects the read() call: 我在我的测试中发现(使用此处描述的技术) urlopen()调用中设置的超时也会影响read()调用:

import urllib2 as u
c = u.urlopen('http://localhost/', timeout=5.0)
s = c.read(1<<20)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/lib/python2.7/httplib.py", line 561, in read
    s = self.fp.read(amt)
  File "/usr/lib/python2.7/httplib.py", line 1298, in read
    return s + self._file.read(amt - len(s))
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
socket.timeout: timed out

Maybe it's a feature of newer versions? 也许这是新版本的功能? I'm using Python 2.7 on a 12.04 Ubuntu straight out of the box. 我在开箱即用的12.04 Ubuntu上使用Python 2.7。

It's not possible for any library to do this without using some kind of asynchronous timer through threads or otherwise. 如果没有通过线程或其他方式使用某种异步计时器,任何库都不可能这样做。 The reason is that the timeout parameter used in httplib , urllib2 and other libraries sets the timeout on the underlying socket . 原因是httpliburllib2和其他库中使用的timeout参数设置底层sockettimeout And what this actually does is explained in the documentation . 文档中解释了实际操作的内容

SO_RCVTIMEO SO_RCVTIMEO

Sets the timeout value that specifies the maximum amount of time an input function waits until it completes. 设置超时值,该值指定输入函数在完成之前等待的最长时间。 It accepts a timeval structure with the number of seconds and microseconds specifying the limit on how long to wait for an input operation to complete. 它接受一个timeval结构,其中包含秒数和微秒数,指定等待输入操作完成的时间限制。 If a receive operation has blocked for this much time without receiving additional data , it shall return with a partial count or errno set to [EAGAIN] or [EWOULDBLOCK] if no data is received. 如果接收操作在没有接收到额外数据的情况下被阻塞了这么长时间,则如果没有收到数据 ,它将返回部分计数或errno设置为[EAGAIN]或[EWOULDBLOCK]。

The bolded part is key. 粗体部分是关键。 A socket.timeout is only raised if not a single byte has been received for the duration of the timeout window. 只有在timeout窗口的持续时间内没有收到单个字节时,才会引发socket.timeout In other words, this is a timeout between received bytes. 换句话说,这是接收字节之间的timeout

A simple function using threading.Timer could be as follows. 使用threading.Timer的简单函数可以如下。

import httplib
import socket
import threading

def download(host, path, timeout = 10):
    content = None

    http = httplib.HTTPConnection(host)
    http.request('GET', path)
    response = http.getresponse()

    timer = threading.Timer(timeout, http.sock.shutdown, [socket.SHUT_RD])
    timer.start()

    try:
        content = response.read()
    except httplib.IncompleteRead:
        pass

    timer.cancel() # cancel on triggered Timer is safe
    http.close()

    return content

>>> host = 'releases.ubuntu.com'
>>> content = download(host, '/15.04/ubuntu-15.04-desktop-amd64.iso', 1)
>>> print content is None
True
>>> content = download(host, '/15.04/MD5SUMS', 1)
>>> print content is None
False

Other than checking for None , it's also possible to catch the httplib.IncompleteRead exception not inside the function, but outside of it. 除了检查None ,还可以捕获不在函数内部但在其外部的httplib.IncompleteRead异常。 The latter case will not work though if the HTTP request doesn't have a Content-Length header. 如果HTTP请求没有Content-Length标头,则后一种情况不起作用。

One possible (imperfect) solution is to set the global socket timeout, explained in more detail here : 一种可能的(不完美的)解决方案是设置全局套接字超时, 这里有更详细的解释:

import socket
import urllib2

# timeout in seconds
socket.setdefaulttimeout(10)

# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)

However, this only works if you're willing to globally modify the timeout for all users of the socket module. 但是,这仅在您愿意全局修改套接字模块的所有用户的超时时才有效。 I'm running the request from within a Celery task, so doing this would mess up timeouts for the Celery worker code itself. 我正在Celery任务中运行请求,因此这样做会使Celery工作程序代码本身的超时混乱。

I'd be happy to hear any other solutions... 我很乐意听到任何其他解决方案......

I'd expect this to be a common problem, and yet - no answers to be found anywhere... Just built a solution for this using timeout signal: 我希望这是一个常见的问题,然而 - 在任何地方都找不到答案......只是使用超时信号为此构建了一个解决方案:

import urllib2
import socket

timeout = 10
socket.setdefaulttimeout(timeout)

import time
import signal

def timeout_catcher(signum, _):
    raise urllib2.URLError("Read timeout")

signal.signal(signal.SIGALRM, timeout_catcher)

def safe_read(url, timeout_time):
    signal.setitimer(signal.ITIMER_REAL, timeout_time)
    url = 'http://uberdns.eu'
    content = urllib2.urlopen(url, timeout=timeout_time).read()
    signal.setitimer(signal.ITIMER_REAL, 0)
    # you should also catch any exceptions going out of urlopen here,
    # set the timer to 0, and pass the exceptions on.

The credit for the signal part of the solution goes here btw: python timer mystery 解决方案信号部分的功劳归结为btw: python timer mystery

pycurl.TIMEOUT option works for the whole request : pycurl.TIMEOUT选项适用于整个请求

#!/usr/bin/env python3
"""Test that pycurl.TIMEOUT does limit the total request timeout."""
import sys
import pycurl

timeout = 2 #NOTE: it does limit both the total *connection* and *read* timeouts
c = pycurl.Curl()
c.setopt(pycurl.CONNECTTIMEOUT, timeout)
c.setopt(pycurl.TIMEOUT, timeout)
c.setopt(pycurl.WRITEFUNCTION, sys.stdout.buffer.write)
c.setopt(pycurl.HEADERFUNCTION, sys.stderr.buffer.write)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.URL, 'http://localhost:8000')
c.setopt(pycurl.HTTPGET, 1)
c.perform()

The code raises the timeout error in ~2 seconds. 代码在~2秒内引发超时错误。 I've tested the total read timeout with the server that sends the response in multiple chunks with the time less than the timeout between chunks: 我已经使用服务器测试了总读取超时,该服务器以多个块的形式发送响应,其时间小于块之间的超时:

$ python -mslow_http_server 1

where slow_http_server.py : 其中slow_http_server.py

#!/usr/bin/env python
"""Usage: python -mslow_http_server [<read_timeout>]

   Return an http response with *read_timeout* seconds between parts.
"""
import time
try:
    from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer, test
except ImportError: # Python 3
    from http.server import BaseHTTPRequestHandler, HTTPServer, test

def SlowRequestHandlerFactory(read_timeout):
    class HTTPRequestHandler(BaseHTTPRequestHandler):
        def do_GET(self):
            n = 5
            data = b'1\n'
            self.send_response(200)
            self.send_header("Content-type", "text/plain; charset=utf-8")
            self.send_header("Content-Length", n*len(data))
            self.end_headers()
            for i in range(n):
                self.wfile.write(data)
                self.wfile.flush()
                time.sleep(read_timeout)
    return HTTPRequestHandler

if __name__ == "__main__":
    import sys
    read_timeout = int(sys.argv[1]) if len(sys.argv) > 1 else 5
    test(HandlerClass=SlowRequestHandlerFactory(read_timeout),
         ServerClass=HTTPServer)

I've tested the total connection timeout with http://google.com:22222 . 我已经使用http://google.com:22222测试了总连接超时

Any asynchronous network library should allow to enforce the total timeout on any I/O operation eg, here's gevent code example : 任何异步网络库都应该允许在任何I / O操作上强制执行总超时,例如,这里是gevent代码示例

#!/usr/bin/env python2
import gevent
import gevent.monkey # $ pip install gevent
gevent.monkey.patch_all()

import urllib2

with gevent.Timeout(2): # enforce total timeout
    response = urllib2.urlopen('http://localhost:8000')
    encoding = response.headers.getparam('charset')
    print response.read().decode(encoding)

And here's asyncio equivalent : 这里是asyncio等价物

#!/usr/bin/env python3.5
import asyncio
import aiohttp # $ pip install aiohttp

async def fetch_text(url):
    response = await aiohttp.get(url)
    return await response.text()

text = asyncio.get_event_loop().run_until_complete(
    asyncio.wait_for(fetch_text('http://localhost:8000'), timeout=2))
print(text)

The test http server is defined here . 测试http服务器在此处定义

This isn't the behavior I see. 这不是我看到的行为。 I get a URLError when the call times out: 当呼叫超时时,我收到URLError

from urllib2 import Request, urlopen
req = Request('http://www.google.com')
res = urlopen(req,timeout=0.000001)
#  Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  ...
#  raise URLError(err)
#  urllib2.URLError: <urlopen error timed out>

Can't you catch this error and then avoid trying to read res ? 你不能抓住这个错误然后避免尝试读取res吗? When I try to use res.read() after this I get NameError: name 'res' is not defined. 当我在此之后尝试使用res.read() ,我得到NameError: name 'res' is not defined. Is something like this what you need: 这是你需要的东西:

try:
    res = urlopen(req,timeout=3.0)
except:           
    print 'Doh!'
finally:
    print 'yay!'
    print res.read()

I suppose the way to implement a timeout manually is via multiprocessing , no? 我想手动实现超时的方法是通过multiprocessing ,不是吗? If the job hasn't finished you can terminate it. 如果作业尚未完成,您可以终止它。

Had the same issue with socket timeout on the read statement. 在read语句中有套接字超时的相同问题。 What worked for me was putting both the urlopen and the read inside a try statement. 对我有用的是把urlopen和read都放在try语句中。 Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM