简体   繁体   English

具有 textacy 或 spacy 的多处理

[英]multiprocessing with textacy or spacy

I am trying to speed up processing of large lists of texts via parallelisation of textacy.我正在尝试通过文本的并行化来加速处理大量文本。 When I use Pool from multiprocessing the resulting textacy corpus comes out empty.当我从多处理中使用 Pool 时,生成的文本语料库是空的。 I am not sure if the problem is in the way I use textacy or multiprocessing paradigm?我不确定问题是否出在我使用 textacy 或 multiprocessing 范例的方式上? Here is the example that illustrates my issue:这是说明我的问题的示例:

import spacy
import textacy
from multiprocessing import Pool

texts_dict={
"key1":"First text 1."
,"key2":"Second text 2."
,"key3":"Third text 3."
,"key4":"Fourth text 4."
}

model=spacy.load('en_core_web_lg')

# this works

corpus = textacy.corpus.Corpus(lang=model)

corpus.add(tuple([value, {'key':key}],) for key,value in texts_dict.items())

print(corpus) # prints Corpus(4 docs, 8 tokens)
print([doc for doc in corpus])

# now the same thing with a worker pool returns empty corpus

corpus2 = textacy.corpus.Corpus(lang=model)

pool = Pool(processes=2) 
pool.map( corpus2.add, (tuple([value, {'key':key}],) for key,value in texts_dict.items()) )

print(corpus2) # prints Corpus(0 docs, 0 tokens)
print([doc for doc in corpus2])

# to make sure we get the right data into corpus.add
pool.map( print, (tuple([value, {'key':key}],) for key,value in texts_dict.items()) )

Textacy is based on spacy. Textacy 是基于 spacy 的。 Spacy doesn't support multithreading but supposedly should be OK to run in multiple processes. Spacy 不支持多线程,但应该可以在多个进程中运行。 https://github.com/explosion/spaCy/issues/2075 https://github.com/explosion/spaCy/issues/2075

As per great suggeston of @constt https://stackoverflow.com/a/58317741/4634344 the collecting of the results into the corpus works for a datasets as large as n_docs= 10273 n_sentences= 302510 n_tokens= 2053129.根据@constt https://stackoverflow.com/a/58317741/4634344的伟大建议,将结果收集到语料库中适用于 n_docs= 10273 n_sentences= 302510 n_tokens= 2053129 的数据集。

For a larger dataset (16000 docs 3MM tokens) I get a following error:对于更大的数据集(16000 个文档 3MM 令牌),我收到以下错误:

result_corpus=corpus.get() 
  File "<string>", line 2, in get
  File "/usr/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Unserializable message: Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/managers.py", line 283, in serve_client
    send(msg)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

´ I will investigate but if you have a direct solution - much appreciated! ´ 我会调查,但如果您有直接的解决方案 - 非常感谢!

Because of the fact that python processes run in separate memory spaces, you have to share your corpus object between processes in the pool.由于 python 进程在单独的 memory 空间中运行,因此您必须在池中的进程之间共享您的corpus object。 To do this, you have to wrap the corpus object into a sharable class which you'll register with a BaseManager class.为此,您必须将corpus object 包装到可共享的 class 中,您将使用BaseManager class 注册它。 Here is how you can refactor your code to make it work:以下是重构代码以使其工作的方法:

#!/usr/bin/python3
from multiprocessing import Pool
from multiprocessing.managers import BaseManager

import spacy
import textacy


texts = {
    'key1': 'First text 1.',
    'key2': 'Second text 2.',
    'key3': 'Third text 3.',
    'key4': 'Fourth text 4.',
}


class PoolCorpus(object):

    def __init__(self):
        model = spacy.load('en_core_web_sm')
        self.corpus = textacy.corpus.Corpus(lang=model)

    def add(self, data):
        self.corpus.add(data)

    def get(self):
        return self.corpus


BaseManager.register('PoolCorpus', PoolCorpus)


if __name__ == '__main__':

    with BaseManager() as manager:
        corpus = manager.PoolCorpus()

        with Pool(processes=2) as pool:
            pool.map(corpus.add, ((v, {'key': k}) for k, v in texts.items()))

        print(corpus.get())

Output: Output:

Corpus(4 docs, 16 tokens)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM