来自http流的非阻塞读取/记录

Question

I have a client that connects to an HTTP stream and logs the text data it consumes. 我有一个客户端连接到HTTP流并记录它消耗的文本数据。

I send the streaming server an HTTP GET request... The server replies and continuously publishes data... It will either publish text or send a ping (text) message regularly... and will never close the connection. 我向流服务器发送HTTP GET请求...服务器回复并不断发布数据......它将发布文本或定期发送ping（文本）消息...并且永远不会关闭连接。

I need to read and log the data it consumes in a non-blocking manner. 我需要以非阻塞方式读取和记录它消耗的数据。

I am doing something like this: 我正在做这样的事情：

import urllib2

req = urllib2.urlopen(url)    
for dat in req: 
    with open('out.txt', 'a') as f:        
        f.write(dat)

My questions are: 我的问题是：
will this ever block when the stream is continuous? 当流连续时，这会阻塞吗？
how much data is read in each chunk and can it be specified/tuned? 每个块中读取了多少数据，是否可以指定/调整？
is this the best way to read/log an http stream? 这是读取/记录http流的最佳方式吗？

Answer 1

Hey, that's three questions in one! 嘿，这是一个三个问题！ ;-) ;-)

It could block sometimes - even if your server is generating data quite quickly, network bottlenecks could in theory cause your reads to block. 它有时可能会阻塞 - 即使您的服务器很快生成数据，理论上网络瓶颈也可能导致您的读取被阻塞。

Reading the URL data using "for dat in req" will mean reading a line at a time - not really useful if you're reading binary data such as an image. 使用“for data in req”读取URL数据意味着一次读取一行 - 如果您正在读取图像等二进制数据，则不是很有用。 You get better control if you use 如果使用，您可以获得更好的控制

chunk = req.read(size)

which can of course block. 这当然可以阻止。

Whether it's the best way depends on specifics not available in your question. 这是否是最好的方式取决于您的问题中没有的具体细节。 For example, if you need to run with no blocking calls whatever, you'll need to consider a framework like Twisted . 例如，如果你需要在没有阻塞调用的情况下运行，你需要考虑像Twisted这样的框架。 If you don't want blocking to hold you up and don't want to use Twisted (which is a whole new paradigm compared to the blocking way of doing things), then you can spin up a thread to do the reading and writing to file, while your main thread goes on its merry way: 如果你不想让阻塞阻止你并且不想使用Twisted（这是一种全新的范式与阻塞的做事方式相比），那么你可以启动一个线程来进行读取和写入文件，而你的主要线程以其快乐的方式：

def func(req):
    #code the read from URL stream and write to file here

...

t = threading.Thread(target=func)
t.start() # will execute func in a separate thread
...
t.join() # will wait for spawned thread to die

Obviously, I've omitted error checking/exception handling etc. but hopefully it's enough to give you the picture. 显然，我已经省略了错误检查/异常处理等，但希望它足以给你提供图片。

Answer 2

You're using too high-level an interface to have good control about such issues as blocking and buffering block sizes. 您使用过高级别的界面来很好地控制阻塞和缓冲块大小等问题。 If you're not willing to go all the way to an async interface (in which case twisted , already suggested, is hard to beat!), why not httplib , which is after all in the standard library? 如果你不愿意一直走到异步接口（在这种情况下扭曲，已经建议，很难被击败！），为什么不是httplib ，这毕竟是在标准库中？ HTTPResponse instance .read(amount) method is more likely to block for no longer than needed to read amount bytes, than the similar method on the object returned by urlopen (although admittedly there are no documented specs about that on either module, hmmm...). HTTPResponse实例.read(amount)方法比urlopen返回的对象上的类似方法更容易阻塞不超过读取amount字节所需的amount （尽管在任何一个模块上都没有关于它的文档规范，嗯...... ）。

Answer 3

Another option is to use the socket module directly. 另一种选择是直接使用socket模块。 Establish a connection, send the HTTP request, set the socket to non-blocking mode, and then read the data with socket.recv() handling 'Resource temporarily unavailable' exceptions (which means that there is nothing to read). 建立连接，发送HTTP请求，将套接字设置为非阻塞模式，然后使用socket.recv()处理“资源暂时不可用”异常（这意味着没有任何内容可读）来读取数据。 A very rough example is this: 一个非常粗略的例子是：

import socket, time

BUFSIZE = 1024

s = socket.socket()
s.connect(('localhost', 1234))
s.send('GET /path HTTP/1.0\n\n')
s.setblocking(False)

running = True

while running:
    try:
        print "Attempting to read from socket..."
        while True:
            data = s.recv(BUFSIZE)
            if len(data) == 0:      # remote end closed
                print "Remote end closed"
                running = False
                break
            print "Received %d bytes: %r" % (len(data), data)
    except socket.error, e:
        if e[0] != 11:      # Resource temporarily unavailable
            print e
            raise

    # perform other program tasks
    print "Sleeping..."
    time.sleep(1)

However, urllib.urlopen() has some benefits if the web server redirects, you need URL based basic authentication etc. You could make use of the select module which will tell you when there is data to read. 但是，如果Web服务器重定向，您需要基于URL的基本身份验证等， urllib.urlopen()有一些好处。您可以使用select模块，它将告诉您何时有数据要读取。

Answer 4

Yes when you catch up with the server it will block until the server produces more data 是的当你赶上服务器时它会阻塞，直到服务器产生更多的数据

Each dat will be one line including the newline on the end 每个数据将是一行，包括最后的换行符

twisted is a good option 扭曲是一个不错的选择

I would swap the with and for around in your example, do you really want to open and close the file for every line that arrives? 我会在你的例子中交换with和for around，你真的想打开并关闭每一行到达的文件吗？

来自http流的非阻塞读取/记录

问题描述

4 个解决方案

解决方案1
6 2009-10-12 22:19:36

解决方案2
3 已采纳 2009-10-13 03:34:29

解决方案3
3 2009-10-14 06:13:33

解决方案4
1 2009-10-12 22:52:05

来自http流的非阻塞读取/记录

问题描述

4 个解决方案

解决方案1 6 2009-10-12 22:19:36

解决方案2 3 已采纳 2009-10-13 03:34:29

解决方案3 3 2009-10-14 06:13:33

解决方案4 1 2009-10-12 22:52:05

解决方案1
6 2009-10-12 22:19:36

解决方案2
3 已采纳 2009-10-13 03:34:29

解决方案3
3 2009-10-14 06:13:33

解决方案4
1 2009-10-12 22:52:05