Python 中的多處理：有沒有辦法在不累積內存的情況下使用 pool.imap？

Question

我在 Python 中使用multiprocessing模塊來並行訓練帶有keras神經網絡，使用Pool(processes = 4)對象和imap 。 這在每個“循環”后穩定地使用越來越多的內存，即每 4 個進程，直到它最終崩潰。

我使用memory_profiler模塊隨着時間的推移跟蹤我的內存使用情況，訓練了 12 個網絡。 這是使用香草imap ：

如果我將maxtasksperchild = 1放入Pool ： 1taskperchild

如果我使用imap(chunksize = 3) ：

在后一種情況下，一切正常，我只向池中的每個進程發送一個批次，因此問題似乎是這些進程攜帶有關先前批次的信息。 如果是這樣，我可以強制池不這樣做嗎？

即使塊解決方案似乎有效，我也不想使用它，因為

我想使用tqdm模塊跟蹤進度，在塊的情況下，它只會在每個塊之后更新，這實際上意味着它根本不會真正跟蹤任何東西，因為所有塊同時完成（在此例子）
目前，所有網絡都需要完全相同的時間來訓練，但我想讓它們有單獨的訓練時間的可能性，其中塊解決方案可能會導致一個過程獲得所有長時間的訓練時間。

這是香草案例中的代碼片段。 在其他兩種情況下，我只是更改了Pool的maxtasksperchild參數和imap的chunksize參數：

def train_network(network):
    (...)
    return score

pool = Pool(processes = 4)
scores = pool.imap(train_network, networks)
scores = tqdm(scores, total = networks.size)

for (network, score) in zip(networks, scores):
    network.score = score

pool.close()
pool.join()

Answer 1

不幸的是，python 中的multiprocessing模塊需要付出很大的代價。 數據大多不在進程之間共享，需要復制。 這將從 python 3.8 開始改變。

https://docs.python.org/3.8/library/multiprocessing.shared_memory.html

雖然，python 3.8 的正式發布時間是 2019 年 10 月 21 日，但你已經可以在github上下載了

Answer 2

我想出了一個似乎有效的解決方案。 我放棄了游泳池並制作了自己的簡單排隊系統。 除了不增加（雖然它確實增加了一點點，但我認為這是我將一些字典存儲為日志），它甚至比上面的塊解決方案消耗更少的內存：

映射隊列

我不知道為什么會這樣。 也許Pool對象只是占用了大量內存？ 無論如何，這是我的代碼：

def train_network(network):
    (...)
    return score

# Define queues to organise the parallelising
todo = mp.Queue(size = networks.size + 4)
done = mp.Queue(size = networks.size)

# Populate the todo queue
for idx in range(networks.size):
    todo.put(idx)

# Add -1's which will be an effective way of checking
# if all todo's are finished
for _ in range(4):
    todo.put(-1)

def worker(todo, done):
    ''' Network scoring worker. '''
    from queue import Empty
    while True:
        try:
            # Fetch the next todo
            idx = todo.get(timeout = 1)
        except Empty:
            # The queue is never empty, so the silly worker has to go
            # back and try again
            continue

        # If we have reached a -1 then stop
        if idx == -1:
            break
        else:
            # Score the network and store it in the done queue
            score = train_network(networks[idx])
            done.put((idx, score))

# Construct our four processes
processes = [mp.Process(target = worker,
    args = (todo, done)) for _ in range(4)]

# Daemonise the processes, which closes them when
# they finish, and start them
for p in processes:
    p.daemon = True
    p.start()

# Set up the iterable with all the scores, and set
# up a progress bar
idx_scores = (done.get() for _ in networks)
pbar = tqdm(idx_scores, total = networks.size)

# Compute all the scores in parallel
for (idx, score) in pbar:
    networks[idx].score = score

# Join up the processes and close the progress bar
for p in processes:
    p.join()
pbar.close()

Python 中的多處理：有沒有辦法在不累積內存的情況下使用 pool.imap？

問題描述

2 個解決方案

解決方案1
1 2019-09-02 15:29:15

解決方案2
1 已采納 2019-09-02 19:39:17

Python 中的多處理：有沒有辦法在不累積內存的情況下使用 pool.imap？

問題描述

2 個解決方案

解決方案1 1 2019-09-02 15:29:15

解決方案2 1 已采納 2019-09-02 19:39:17

解決方案1
1 2019-09-02 15:29:15

解決方案2
1 已采納 2019-09-02 19:39:17