[英]How to initialize a pool of python multiprocessing workers with a shared state?
我正在嘗試並行執行一些機器學習算法 。
當我使用多處理時,它比沒有處理要慢。 我的瘋狂猜測是,我使用的模型的pickle
序列化會減慢整個過程。 因此,問題是: 如何用初始狀態初始化池的工作程序,這樣就不必為每次調用模型都進行序列化/反序列化了?
這是我當前的代碼:
import pickle
from pathlib import Path
from collections import Counter
from multiprocessing import Pool
from gensim.models.doc2vec import Doc2Vec
from wikimark import html2paragraph
from wikimark import tokenize
def process(args):
doc2vec, regressions, filepath = args
with filepath.open('r') as f:
string = f.read()
subcategories = Counter()
for index, paragraph in enumerate(html2paragraph(string)):
tokens = tokenize(paragraph)
vector = doc2vec.infer_vector(tokens)
for subcategory, model in regressions.items():
prediction = model.predict([vector])[0]
subcategories[subcategory] += prediction
# compute the mean score for each subcategory
for subcategory, prediction in subcategories.items():
subcategories[subcategory] = prediction / (index + 1)
# keep only the main category
subcategory = subcategories.most_common(1)[0]
return (filepath, subcategory)
def main():
input = Path('./build')
doc2vec = Doc2Vec.load(str(input / 'model.doc2vec.gz'))
regressions = dict()
for filepath in input.glob('./*/*/*.model'):
with filepath.open('rb') as f:
model = pickle.load(f)
regressions[filepath.parent] = model
examples = list(input.glob('../data/wikipedia/english/*'))
with Pool() as pool:
iterable = zip(
[doc2vec] * len(examples), # XXX!
[regressions] * len(examples), # XXX!
examples
)
for filepath, subcategory in pool.imap_unordered(process, iterable):
print('* {} -> {}'.format(filepath, subcategory))
if __name__ == '__main__':
main()
標有XXX!
的行XXX!
指向我調用pool.imap_unodered
時序列化的數據。 至少有200MB的數據已序列化。
如何避免序列化?
該解決方案非常簡單,就像對doc2vec
和regressions
使用全局regressions
。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.