簡體   English   中英

具有 textacy 或 spacy 的多處理

[英]multiprocessing with textacy or spacy

我正在嘗試通過文本的並行化來加速處理大量文本。 當我從多處理中使用 Pool 時,生成的文本語料庫是空的。 我不確定問題是否出在我使用 textacy 或 multiprocessing 范例的方式上? 這是說明我的問題的示例:

import spacy
import textacy
from multiprocessing import Pool

texts_dict={
"key1":"First text 1."
,"key2":"Second text 2."
,"key3":"Third text 3."
,"key4":"Fourth text 4."
}

model=spacy.load('en_core_web_lg')

# this works

corpus = textacy.corpus.Corpus(lang=model)

corpus.add(tuple([value, {'key':key}],) for key,value in texts_dict.items())

print(corpus) # prints Corpus(4 docs, 8 tokens)
print([doc for doc in corpus])

# now the same thing with a worker pool returns empty corpus

corpus2 = textacy.corpus.Corpus(lang=model)

pool = Pool(processes=2) 
pool.map( corpus2.add, (tuple([value, {'key':key}],) for key,value in texts_dict.items()) )

print(corpus2) # prints Corpus(0 docs, 0 tokens)
print([doc for doc in corpus2])

# to make sure we get the right data into corpus.add
pool.map( print, (tuple([value, {'key':key}],) for key,value in texts_dict.items()) )

Textacy 是基於 spacy 的。 Spacy 不支持多線程,但應該可以在多個進程中運行。 https://github.com/explosion/spaCy/issues/2075

根據@constt https://stackoverflow.com/a/58317741/4634344的偉大建議,將結果收集到語料庫中適用於 n_docs= 10273 n_sentences= 302510 n_tokens= 2053129 的數據集。

對於更大的數據集(16000 個文檔 3MM 令牌),我收到以下錯誤:

result_corpus=corpus.get() 
  File "<string>", line 2, in get
  File "/usr/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Unserializable message: Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/managers.py", line 283, in serve_client
    send(msg)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

´ 我會調查,但如果您有直接的解決方案 - 非常感謝!

由於 python 進程在單獨的 memory 空間中運行,因此您必須在池中的進程之間共享您的corpus object。 為此,您必須將corpus object 包裝到可共享的 class 中,您將使用BaseManager class 注冊它。 以下是重構代碼以使其工作的方法:

#!/usr/bin/python3
from multiprocessing import Pool
from multiprocessing.managers import BaseManager

import spacy
import textacy


texts = {
    'key1': 'First text 1.',
    'key2': 'Second text 2.',
    'key3': 'Third text 3.',
    'key4': 'Fourth text 4.',
}


class PoolCorpus(object):

    def __init__(self):
        model = spacy.load('en_core_web_sm')
        self.corpus = textacy.corpus.Corpus(lang=model)

    def add(self, data):
        self.corpus.add(data)

    def get(self):
        return self.corpus


BaseManager.register('PoolCorpus', PoolCorpus)


if __name__ == '__main__':

    with BaseManager() as manager:
        corpus = manager.PoolCorpus()

        with Pool(processes=2) as pool:
            pool.map(corpus.add, ((v, {'key': k}) for k, v in texts.items()))

        print(corpus.get())

Output:

Corpus(4 docs, 16 tokens)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM