multiprocessing.Pool.imap_unordered與固定隊列大小或緩沖區？

Question

我正在從大型CSV文件中讀取數據，對其進行處理並將其加載到SQLite數據庫中。 分析表明80％的時間花在I / O上，20％是處理輸入以准備數據庫插入。 我使用multiprocessing.Pool加快了處理步驟，以便I / O代碼永遠不會等待下一條記錄。 但是，這導致了嚴重的內存問題，因為I / O步驟無法跟上工作人員的步伐。

以下玩具示例說明了我的問題：

#!/usr/bin/env python  # 3.4.3
import time
from multiprocessing import Pool

def records(num=100):
    """Simulate generator getting data from large CSV files."""
    for i in range(num):
        print('Reading record {0}'.format(i))
        time.sleep(0.05)  # getting raw data is fast
        yield i

def process(rec):
    """Simulate processing of raw text into dicts."""
    print('Processing {0}'.format(rec))
    time.sleep(0.1)  # processing takes a little time
    return rec

def writer(records):
    """Simulate saving data to SQLite database."""
    for r in records:
        time.sleep(0.3)  # writing takes the longest
        print('Wrote {0}'.format(r))

if __name__ == "__main__":
    data = records(100)
    with Pool(2) as pool:
        writer(pool.imap_unordered(process, data, chunksize=5))

此代碼導致記錄積壓，最終消耗所有內存，因為我無法足夠快地將數據持久保存到磁盤。 運行代碼，您會注意到，當writer處於第15條記錄時， Pool.imap_unordered將消耗所有數據。 現在假設處理步驟正在生成數億行的字典，你可以看到我內存不足的原因。 也許阿姆達爾的法律在行動。

有什么辦法解決這個問題？ 我想我需要一些緩沖的Pool.imap_unordered ，說：“一旦有X需要插入的記錄，停止並等待，直到有使更多的前小於x。” 在最后一個記錄被保存時，我應該能夠從准備下一個記錄中獲得一些速度提升。

我嘗試使用NuMap從papy模塊（我修改與Python 3工作）做的正是這一點，但它是不是更快。 事實上，它比順序運行程序更糟糕; NuMap使用兩個線程和多個進程。

SQLite的批量導入功能可能不適合我的任務，因為數據需要大量處理和規范化。

我有大約85G的壓縮文本要處理。 我對其他數據庫技術持開放態度，但選擇SQLite是為了便於使用，因為這是一次寫入多次讀取的工作，在加載完所有內容后，只有3或4個人將使用生成的數據庫。

Answer 1

當我正在處理同樣的問題時，我認為防止池過載的有效方法是使用帶有生成器的信號量：

from multiprocessing import Pool, Semaphore

def produce(semaphore, from_file):
    with open(from_file) as reader:
        for line in reader:
            # Reduce Semaphore by 1 or wait if 0
            semaphore.acquire()
            # Now deliver an item to the caller (pool)
            yield line

def process(item):
    result = (first_function(item),
              second_function(item),
              third_function(item))
    return result

def consume(semaphore, result):
    database_con.cur.execute("INSERT INTO ResultTable VALUES (?,?,?)", result)
    # Result is consumed, semaphore may now be increased by 1
    semaphore.release()

def main()
    global database_con
    semaphore_1 = Semaphore(1024)
    with Pool(2) as pool:
        for result in pool.imap_unordered(process, produce(semaphore_1, "workfile.txt"), chunksize=128):
            consume(semaphore_1, result)

也可以看看：

K Hong - 多線程 - 信號量對象和線程池

Chris Terman講座 - 麻省理工學院6.004 L21：信號量

Answer 2

由於處理速度很快，但寫入速度很慢，聽起來你的問題是I / O限制。 因此，使用多處理可能沒什么好處。

但是，可以剝離data塊，處理data塊，並等到數據寫入之后再剝離另一個數據塊：

import itertools as IT
if __name__ == "__main__":
    data = records(100)
    with Pool(2) as pool:
        chunksize = ...
        for chunk in iter(lambda: list(IT.islice(data, chunksize)), []):
            writer(pool.imap_unordered(process, chunk, chunksize=5))

Answer 3

聽起來你真正需要的是用有界（和阻塞）隊列替換Pool下面的無界隊列。 這樣一來，如果任何一方領先於其他方面，它就會阻止它們准備就緒。

通過查看源代碼，子類或monkeypatch Pool可以很容易地做到這一點，例如：

class Pool(multiprocessing.pool.Pool):
    def _setup_queues(self):
        self._inqueue = self._ctx.Queue(5)
        self._outqueue = self._ctx.Queue(5)
        self._quick_put = self._inqueue._writer.send
        self._quick_get = self._outqueue._reader.recv
        self._taskqueue = queue.Queue(10)

但這顯然不可移植（即使是CPython 3.3，更不用說不同的Python 3實現）。

我認為你可以通過提供自定義的context在3.4+中進行移植，但是我無法做到這一點，所以...

Answer 4

一個簡單的解決方法可能是使用psutil來檢測每個進程中的內存使用情況，並說出是否占用了超過90％的內存，而不是只是休眠一段時間。

while psutil.virtual_memory().percent > 75:
            time.sleep(1)
            print ("process paused for 1 seconds!")

multiprocessing.Pool.imap_unordered與固定隊列大小或緩沖區？

問題描述

4 個解決方案

解決方案1
4 2017-11-01 15:33:58

解決方案2
2 已采納 2015-05-26 02:37:44

解決方案3
1 2015-05-26 02:12:39

解決方案4
1 2019-03-25 03:34:16

multiprocessing.Pool.imap_unordered與固定隊列大小或緩沖區？

問題描述

4 個解決方案

解決方案1 4 2017-11-01 15:33:58

解決方案2 2 已采納 2015-05-26 02:37:44

解決方案3 1 2015-05-26 02:12:39

解決方案4 1 2019-03-25 03:34:16

解決方案1
4 2017-11-01 15:33:58

解決方案2
2 已采納 2015-05-26 02:37:44

解決方案3
1 2015-05-26 02:12:39

解決方案4
1 2019-03-25 03:34:16