如何防止 multiprocessing.pool 消耗我的所有内存？

Question

My multiprocessing pool (8 cores, 16 GB RAM) is using all of my memory before ingesting much data.我的多处理池（8 核，16 GB RAM）在摄取大量数据之前使用了我的所有内存。 I am operating on a 6 GB dataset我正在处理一个 6 GB 的数据集

I have tried using various types of processors, including imap, imap_unordered, apply, map, etc. I have also tried maxtasksperchild, which seems to increase memory usage.我尝试过使用各种类型的处理器，包括 imap、imap_unordered、apply、map 等。我也尝试过 maxtasksperchild，它似乎增加了内存使用量。

import string
import re
import multiprocessing as mp
from tqdm import tqdm

linkregex = re.compile(r"http\S+")
puncregex = re.compile(r"(?<=\w)[^\s\w](?![^\s\w])")
emojiregex = re.compile(r"(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])")

sentences = []

def process(item):
    return re.sub(emojiregex, r" \1 ", re.sub(puncregex,"",re.sub(linkregex, "link", item))).lower().split()


if __name__ == '__main__':
    with mp.Pool(8) as pool:
        sentences = list(tqdm(pool.imap_unordered(process, open('scrape/output.txt')
), total=52123146))

print(str(len(sentences)))
with open("final/word2vectweets.txt", "a+") as out:
    out.write(sentences)

This should return a list of processed lines from the file, but it consumes memory too fast.这应该从文件中返回一个已处理行的列表，但它消耗内存太快。 Previous versions with no mp and simpler processing have been successful.以前没有 mp 和更简单处理的版本已经成功。

Answer 1

How does this look?这看起来如何？

import re
import multiprocessing as mp

linkregex = re.compile(r"http\S+")
puncregex = re.compile(r"(?<=\w)[^\s\w](?![^\s\w])")
emojiregex = re.compile(r"(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])")


def process(item):
    return re.sub(emojiregex, r" \1 ", re.sub(puncregex,"",re.sub(linkregex, "link", item))).lower().split()


with mp.Pool() as pool, open(in_file_path, 'r') as file_in, open(out_file_path, 'a') as file_out:
    for curr_sentence in pool.imap_unordered(process_line, file_in, chunksize=1000):
        file_out.write(f'{curr_sentence}\n')

I tested a bunch of chunk sizes, 1000 seems to be the sweet spot.我测试了一堆块大小，1000 似乎是最佳点。 I will keep investigating.我会继续调查。

如何防止 multiprocessing.pool 消耗我的所有内存？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-11-05 22:18:36

如何防止 multiprocessing.pool 消耗我的所有内存？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-11-05 22:18:36

解决方案1
0 已采纳 2019-11-05 22:18:36