简体   繁体   English

如何防止 multiprocessing.pool 消耗我的所有内存?

[英]How can I prevent multiprocessing.pool from consuming all of my memory?

My multiprocessing pool (8 cores, 16 GB RAM) is using all of my memory before ingesting much data.我的多处理池(8 核,16 GB RAM)在摄取大量数据之前使用了我的所有内存。 I am operating on a 6 GB dataset我正在处理一个 6 GB 的数据集

I have tried using various types of processors, including imap, imap_unordered, apply, map, etc. I have also tried maxtasksperchild, which seems to increase memory usage.我尝试过使用各种类型的处理器,包括 imap、imap_unordered、apply、map 等。我也尝试过 maxtasksperchild,它似乎增加了内存使用量。

import string
import re
import multiprocessing as mp
from tqdm import tqdm

linkregex = re.compile(r"http\S+")
puncregex = re.compile(r"(?<=\w)[^\s\w](?![^\s\w])")
emojiregex = re.compile(r"(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])")

sentences = []

def process(item):
    return re.sub(emojiregex, r" \1 ", re.sub(puncregex,"",re.sub(linkregex, "link", item))).lower().split()


if __name__ == '__main__':
    with mp.Pool(8) as pool:
        sentences = list(tqdm(pool.imap_unordered(process, open('scrape/output.txt')
), total=52123146))

print(str(len(sentences)))
with open("final/word2vectweets.txt", "a+") as out:
    out.write(sentences)

This should return a list of processed lines from the file, but it consumes memory too fast.这应该从文件中返回一个已处理行的列表,但它消耗内存太快。 Previous versions with no mp and simpler processing have been successful.以前没有 mp 和更简单处理的版本已经成功。

How does this look?这看起来如何?

import re
import multiprocessing as mp

linkregex = re.compile(r"http\S+")
puncregex = re.compile(r"(?<=\w)[^\s\w](?![^\s\w])")
emojiregex = re.compile(r"(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])")


def process(item):
    return re.sub(emojiregex, r" \1 ", re.sub(puncregex,"",re.sub(linkregex, "link", item))).lower().split()


with mp.Pool() as pool, open(in_file_path, 'r') as file_in, open(out_file_path, 'a') as file_out:
    for curr_sentence in pool.imap_unordered(process_line, file_in, chunksize=1000):
        file_out.write(f'{curr_sentence}\n')

I tested a bunch of chunk sizes, 1000 seems to be the sweet spot.我测试了一堆块大小,1000 似乎是最佳点。 I will keep investigating.我会继续调查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM