使用多處理實現排序的生產者/消費者隊列

Question

我有一個非常常見的生產者/消費者場景，有一個轉折點。

我需要從多 GB 輸入 stream（可以是文件或 HTTP 流）中讀取文本行； 使用緩慢且占用大量 CPU 的算法處理每一行，該算法將為每行輸入提供一行文本； 然后將 output 行寫入另一個 stream。 扭曲的是，我需要按照與生成它們的輸入行相同的順序編寫 output 行。

這些場景的常用方法是使用 multiprocessing.Pool 來運行 CPU 密集型算法，一個隊列從讀取器進程中輸入行（實際上是成批的行），另一個隊列從池中引出並進入寫程序：

                       / [Pool] \    
  [Reader] --> InQueue --[Pool]---> OutQueue --> [Writer]
                       \ [Pool] /

但是如何確保 output 行（或批次）按正確的順序排序？

一個簡單的答案是，“只需將它們寫入臨時文件，然后對文件進行排序並將其寫入輸出”。 我可能最終會這樣做，但我真的很想盡快開始流式傳輸 output 行 - 而不是等待整個輸入 stream 從頭到尾處理。

我可以輕松編寫自己的 multiprocessing.Queue 實現，它將使用字典（或循環緩沖區列表）、鎖和兩個條件（可能加上 integer 計數器）在內部對其項目進行排序。 但是，我需要從管理器中獲取所有這些對象，而且我擔心在多個進程之間使用像這樣的共享 state 會降低我的性能。 那么，是否有一些適當的 Pythony 方法來解決這個問題？

Answer 1

也許我遺漏了一些東西，但似乎你的問題有一個基本的答案。

讓我們舉一個簡單的例子：我們只想從文本中反轉行。 這是我們要反轉的行：

INPUT = ["line {}".format(i)[::-1] for i in range(30)]

那是：

['0 enil', '1 enil', '2 enil', ..., '92 enil']

這是反轉這些行的方法：

import time, random

def process_line(line):
    time.sleep(1.5*random.random()) # simulation of an heavy computation
    return line[::-1]

這些行來自一個來源：

def source():
    for s in INPUT:
        time.sleep(0.5*random.random()) # simulation of the network latency
        yield s

我們可以使用多處理來提高速度：

from multiprocessing import *

with Pool(3) as p:
    for line in p.imap_unordered(process_line, source()):
        print(line)

但是我們的行不在預期的順序：

line 0
line 2
line 1
line 5
line 3
line 4
...
line 27
line 26
line 29
line 28

要按預期順序獲取該行，您可以：

索引行
處理它們並
按照預期的順序收集它們。

首先，索引行：

def wrapped_source():
    for i, s in enumerate(source()):
        yield i, s

其次，處理該行，但保留索引：

def wrapped_process_line(args):
    i, line = args
    return i, process_line(line)

第三，按照預期的順序收集線條。 這個想法是使用一個計數器和一個堆。 計數器是下一行的預期索引。

采取下一對（索引，處理線）：

如果索引等於計數器，則只產生已處理的行。
如果不是，則將這對（索引，已處理的行）存儲在堆中。

然后，當堆中的最小索引等於計數器時，彈出最小元素並產生該行。

循環直到源為空，然后刷新堆。

from heapq import *
h = []

with Pool(3) as p:
    expected_i = 0 #
    for i, line in p.imap_unordered(wrapped_process_line, wrapped_source()):
        if i == expected_i: # lucky!
            print(line)
            expected_i += 1
        else: # unlucky!
            heappush(h, (i, line)) # store the processed line

        while h: # look for the line "expected_i" in the heap
            i_smallest, line_smallest = h[0] # the smallest element
            if i_smallest == expected_i:
                heappop(h)
                print(line_smallest)
                expected_i += 1
            else:
                break # the line "expect_i" was not processed yet.

    while h: # flush the heap
        print(heappop(h)[1]) # the line

現在我們的行按預期順序排列：

line 0
line 1
line 2
line 3
line 4
line 5
...
line 26
line 27
line 28
line 29

沒有額外的延遲：如果下一個預期的行還沒有被處理，我們必須等待，但是一旦這行到達，我們就讓出它。

主要缺點是您必須手動處理（超時，新請求，...）潛在的差距：一旦您索引了您的行，如果您丟失了一行（無論出於何種原因），循環將等待該行直到源耗盡，然后才刷新堆。 在這種情況下，您可能會用完 memory。

使用多處理實現排序的生產者/消費者隊列

問題描述

1 個解決方案

解決方案1
0 2020-01-24 19:33:33

使用多處理實現排序的生產者/消費者隊列

問題描述

1 個解決方案

解決方案1 0 2020-01-24 19:33:33

解決方案1
0 2020-01-24 19:33:33