如何使用共享狀態初始化python多處理工作者池？

Question

當我使用多處理時，它比沒有處理要慢。 我的瘋狂猜測是，我使用的模型的pickle序列化會減慢整個過程。 因此，問題是： 如何用初始狀態初始化池的工作程序，這樣就不必為每次調用模型都進行序列化/反序列化了？

這是我當前的代碼：

import pickle
from pathlib import Path
from collections import Counter
from multiprocessing import Pool

from gensim.models.doc2vec import Doc2Vec

from wikimark import html2paragraph
from wikimark import tokenize


def process(args):
    doc2vec, regressions, filepath = args
    with filepath.open('r') as f:
        string = f.read()
    subcategories = Counter()
    for index, paragraph in enumerate(html2paragraph(string)):
        tokens = tokenize(paragraph)
        vector = doc2vec.infer_vector(tokens)
        for subcategory, model in regressions.items():
            prediction = model.predict([vector])[0]
            subcategories[subcategory] += prediction
    # compute the mean score for each subcategory
    for subcategory, prediction in subcategories.items():
        subcategories[subcategory] = prediction / (index + 1)
    # keep only the main category
    subcategory = subcategories.most_common(1)[0]
    return (filepath, subcategory)


def main():
    input = Path('./build')
    doc2vec = Doc2Vec.load(str(input / 'model.doc2vec.gz'))
    regressions = dict()
    for filepath in input.glob('./*/*/*.model'):
        with filepath.open('rb') as f:
            model = pickle.load(f)
        regressions[filepath.parent] = model

    examples = list(input.glob('../data/wikipedia/english/*'))

    with Pool() as pool:
        iterable = zip(
            [doc2vec] * len(examples),  # XXX!
            [regressions] * len(examples),  # XXX!
            examples
        )
        for filepath, subcategory in pool.imap_unordered(process, iterable):
            print('* {} -> {}'.format(filepath, subcategory))


if __name__ == '__main__':
    main()

標有XXX!的行XXX! 指向我調用pool.imap_unodered時序列化的數據。 至少有200MB的數據已序列化。

如何避免序列化？

Answer 1

該解決方案非常簡單，就像對doc2vec和regressions使用全局regressions 。

如何使用共享狀態初始化python多處理工作者池？

問題描述

1 個解決方案

解決方案1
1 已采納 2018-06-02 16:52:37

如何使用共享狀態初始化python多處理工作者池？

問題描述

1 個解決方案

解決方案1 1 已采納 2018-06-02 16:52:37

解決方案1
1 已采納 2018-06-02 16:52:37