简体   繁体   English

Python中的异步流处理

[英]Asynchronous Stream Processing in Python

Let's start with a simple example. 让我们从一个简单的例子开始。 A HTTP data stream comes in the following format: HTTP数据流采用以下格式:

MESSAGE_LENGTH, 2 bytes
MESSAGE_BODY, 
REPEAT...

Currently, I use urllib2 to retrieve and process streaming data as below: 目前,我使用urllib2来检索和处理流数据,如下所示:

length = response.read(2)
while True:
    data = response.read(length)
    DO DATA PROCESSING

It works, but since all messages are in size of 50-100 bytes, the above method limits buffer size each time it reads so it may hurt performance. 它可以工作,但由于所有消息的大小都是50-100字节,因此上述方法每次读取时都会限制缓冲区大小,因此可能会影响性能。

Is it possible to use seperate threads for data retrieval and processing? 是否可以使用单独的线程进行数据检索和处理?

Yes, can be done and is not that hard, if your format is essentially fixed. 是的,如果您的格式基本上是固定的,那么可以做到并且不那么难。

I used it with httplib in Python 2.2.3 and found it had some abysmal performance in the way we hacked it together (basically monkey patching a select() based socket layer into httplib). 我在Python 2.2.3中将它与httplib一起使用,并发现它在我们一起攻击它的方式上有一些糟糕的性能(基本上是将基于select()的套接字层修补到httplib中)。

The trick is to get the socket and do the buffering yourself, so you do not fight over buffering with the intermediate layers (made for horrible performance when we had httplib buffer for chunked http decoding, the socket layer buffer for read()). 诀窍是获取套接字并自己进行缓冲,所以你不要争夺中间层的缓冲(当我们有用于分块http解码的httplib缓冲区时,为了读取()的套接字层缓冲区,这是为了糟糕的性能。

Then have a statemachine that fetches new data from the socket when needed and pushes completed blocks into a Queue.Queue that feeds your processing threads. 然后有一个状态机,在需要时从套接字中获取新数据,并将完成的块推送到为处理线程提供信息的Queue.Queue中。

I use it to transfer files, checksum (zlib.ADLER32) them in an extra thread and write them to the filesystem in a third thread. 我使用它在一个额外的线程中传输文件,校验和(zlib.ADLER32),并将它们写入第三个线程中的文件系统。 Makes for about 40 MB/s sustained throughput on my local machine via sockets and with HTTP/chunked overhead. 通过套接字和HTTP / chunked开销在本地计算机上实现大约40 MB / s的持续吞吐量。

Yes, of course, and there are many different techniques to do so. 是的,当然,有很多不同的技术可以做到这一点。 You'll typically end up having a set of processes that only retrieves data, and increase the number of processes in that pool until you run out of bandwith, more or less. 您通常最终会拥有一组只检索数据的进程,并增加该池中的进程数,直到您用完带宽,或多或少。 Those processes store the data somewhere, and then you have other processes or threads that pick the data up and process it from wherever it's stored. 这些进程将数据存储在某个地方,然后您有其他进程或线程来挑选数据并从存储的任何位置处理它。

So the answer to your question is "Yes", your next question is gonna be "How" and then the people who are really good at this stuff will want to know more. 所以你的问题的答案是“是”,你的下一个问题将是“如何”,那么真正擅长这些东西的人会想要了解更多。 :-) :-)

If you are doing this in a massive scale it can get very tricky, and you don't want them to step all over each other, and there are modules in Python that help you do all this. 如果你大规模地这样做,它会变得非常棘手,并且你不希望它们彼此重叠,并且Python中有一些模块可以帮助你完成所有这些工作。 What the right way to do it is depends a lot on what scale we are talking, if you want to run this over multiple processors, or maybe even over completely separate machines, and how much data we are talking about. 正确的做法取决于我们谈论的规模,如果你想在多个处理器上运行,或者甚至在完全独立的机器上运行,以及我们谈论的数据量。

I've only done it once, and on a not very massive scale, but ended up having once process that got a long list of urls that should be processed, and another process that took that list and dispatched it to a set of separate processes simply by putting files with URL's in them in separate directories that worked as "queues". 我只做了一次,并且规模不是很大,但最终有一个进程得到了一长串应该处理的url,另一个进程接受了该列表并将其分配给一组独立的进程只需将带有URL的文件放在作为“队列”的单独目录中即可。 The separate processes that fetched the URLs would look in their own queue-directory, fetch the URL and stick it into another "outqueue" directory, where I had another process that would dispatch those files into another set of queue-directories for the processing processes. 获取URL的单独进程将在其自己的队列目录中查找,获取URL并将其粘贴到另一个“outqueue”目录中,其中我有另一个进程将这些文件分派到另一组队列目录中以用于处理进程。

That worked fine, could be run of the network with NFS if necessary (although we never tried that) and could be scaled up to loads of processes on loads of machines if neeed (although we never did that either). 这很好,可以在必要时使用NFS运行网络(尽管我们从未尝试过),如果需要,可以扩展到大量机器上的进程(尽管我们从未这样做过)。

There may be more clever ways. 可能有更聪明的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM