简体   繁体   English

Python:如何将外部队列与ProcessPoolExecutor一起使用?

[英]Python: How can I use an external queue with a ProcessPoolExecutor?

I've very recently started using Python's multi threading and multi processing features. 我最近开始使用Python的多线程和多处理功能。

I tried to write code which uses a producer/consumer approach to read chunks from a JSON log file, write those chunks as events into a queue and then start a set of processes that will poll events from that queue (file chunks) and process each one of them, printing out the results. 我尝试编写使用生产者/消费者方法从JSON日志文件中读取块的代码,将这些块作为事件写入队列,然后启动一组进程来轮询来自该队列的事件(文件块)并处理每个其中之一,打印出结果。

My intent is to start the processes first, and leave them waiting for the events to start coming into the queue. 我的目的是首先启动进程,让他们等待事件开始进入队列。

I'm currently using this code, which seems to work, using some bits and pieces from examples I found: 我目前正在使用这个代码,它似乎有效,使用我发现的示例中的一些零碎:

import re, sys
from multiprocessing import Process, Queue

def process(file, chunk):
    f = open(file, "rb")
    f.seek(chunk[0])
    for entry in pat.findall(f.read(chunk[1])):
        print(entry)

def getchunks(file, size=1024*1024):
    f = open(file, "rb")
    while True:
        start = f.tell()
        f.seek(size, 1)
        s = f.readline() # skip forward to next line ending
        yield start, f.tell() - start
        if not s:
            break

def processingChunks(queue):
    while True:
        queueEvent = queue.get()
        if (queueEvent == None):
            queue.put(None)
            break
        process(queueEvent[0], queueEvent[1])

if __name__ == "__main__":
    testFile = "testFile.json"
    pat = re.compile(r".*?\n")
    queue = Queue()

    for w in xrange(6):
        p = Process(target=processingChunks, args=(queue,))
        p.start()

    for chunk in getchunks(testFile):
        queue.put((testFile, chunk))
        print(queue.qsize())
    queue.put(None)

However, I wanted to learn how to use the concurrent.futures ProcessPoolExecutor to achieve the same results in an asynchronous manner, using Future result objects. 但是,我想学习如何使用concurrent.futures ProcessPoolExecutor以异步方式使用Future结果对象来实现相同的结果。

My first attempt implied using an external queue, created with the multiprocessing Manager, which I would pass to the processes for polling. 我的第一次尝试暗示使用由多处理管理器创建的外部队列,我将传递给轮询进程。

However this doesn't seem to work and I reckon it is possible that this is not the way ProcessPoolExecutor was designed to work, as it seems to use an internal queue of it's own. 然而,这似乎不起作用,我认为这可能不是ProcessPoolExecutor设计工作的方式,因为它似乎使用它自己的内部队列。

I used this code: 我用过这段代码:

import concurrent
from concurrent.futures import as_completed
import re, sys
from multiprocessing import Lock, Process, Queue, current_process, Pool, Manager

def process(file, chunk):
    entries = []
    f = open(file, "rb")
    f.seek(chunk[0])
    for entry in pat.findall(f.read(chunk[1])):
        entries.append(entry)
        return entries

def getchunks(file, size=1024*1024):
    f = open(file, "rb")
    while True:
        start = f.tell()
        f.seek(size, 1)
        s = f.readline() # skip forward to next line ending
        yield start, f.tell() - start
        if not s:
            break

def processingChunks(queue):
    while True:
        queueEvent = queue.get()
        if (queueEvent == None):
            queue.put(None)
            break
        return process(queueEvent[0], queueEvent[1])

if __name__ == "__main__":
    testFile = "testFile.json"
    pat = re.compile(r".*?\n")
    procManager = Manager()
    queue = procManager.Queue()

    with concurrent.futures.ProcessPoolExecutor(max_workers = 6) as executor:
        futureResults = []
        for i in range(6):
            future_result = executor.submit(processingChunks, queue)
            futureResults.append(future_result)

        for complete in as_completed(futureResults):
            res = complete.result()
            for i in res:
                print(i)


    for chunk in getchunks(testFile):
        queue.put((testFile, chunk))
        print(queue.qsize())
    queue.put(None)

I'm unable to obtain any results with this, so obviously I'm doing something wrong and there's something about the concept that I didn't understand. 我无法用这个获得任何结果,所以很明显我做错了什么,并且有一些我不理解的概念。

Can you guys please give me a hand understanding how I could implement this? 你们能帮我们理解我是如何实现这个的吗?

If you're using a ProcessPoolExecutor , you don't need your processingChunks function at all, or any of the stuff you're importing from multiprocessing . 如果您正在使用ProcessPoolExecutor ,则根本不需要您的processingChunks功能,或者您从multiprocessing导入的任何内容。 The pool does essentially what your function was doing before automatically. 该池基本上完成了您的功能之前自动执行的操作。 Instead, use something like this to queue up and dispatch all the work in one go: 相反,使用这样的东西排队并一次性调度所有工作:

with concurrent.futures.ProcessPoolExecutor(max_workers = 6) as executor:
    executor.map(process, itertools.repeat(testFile), getchunks(testFile))

I'm not sure how your original code worked with pat not being an argument to process (I'd have expected every worker process to fail with a NameError exception). 我不知道你原来的代码如何与合作pat不是一个参数process (我已预料每一个工作进程失败,并NameError除外)。 If that's a real issue (and not just an artifact of your example code), you may need to modify things a bit more to pass it in to the worker processes along with file and chunk ( itertools.repeat may come in handy). 如果这是一个真正的问题(而不仅仅是您的示例代码的工件),您可能需要稍微修改一些内容以将其与filechunk一起传递给工作进程( itertools.repeat可能会派上用场)。

Thanks to Blckknght, who's reply pushed me in the right direction. 感谢Blckknght,他的回复让我朝着正确的方向前进。 Here's a possible solution for my initial question: 这是我最初问题的可能解决方案:

#!/usr/bin/python
import concurrent
from concurrent.futures import as_completed
import re, sys

def process(event):
    entries = []
    fl = event[0]
    chunk = event[1]
    pat = event[2]
    f = open(fl, "rb")
    f.seek(chunk[0])
    for entry in pat.findall(f.read(chunk[1])):
       entries.append(entry)
    return entries

def getchunks(file, pat, size=1024*1024):
    f = open(file, "rb")
    while True:
        start = f.tell()
        f.seek(size, 1)
        s = f.readline() # skip forward to next line ending
        yield (file, (start, f.tell() - start), pat)
        if not s:
            break

if __name__ == "__main__":
    testFile = "testFile.json"
    pat = re.compile(r".*?\n")
    results = []

    with concurrent.futures.ProcessPoolExecutor() as executor:
        for res in (executor.submit(process, event) for event in getchunks(testFile, pat)):
            results.append(res)

    for complete in as_completed(results):
        for entry in complete.result():
            print('Event result: %s' % entry)    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM