简体   繁体   中英

Correct multithreaded lemmatization using spaCy

I'm trying to multithread the lemmatization of my corpus using spaCy. Following the documentation , this is currently my approach:

import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'tagger'])

def lemmatize():
    for doc in nlp.pipe(corpus, batch_size=2, n_threads=10):
        yield ' '.join([token.lemma_ for token in doc])

new_corpus = list(lemmatize())

However, this takes the same amount of time regardless when using 10 or 1 thread (I use it on 100.000 documents), suggesting that it is not multithreaded.

Is my implementation wrong?

The n_threads argument has been deprecated in newer versions of spacy and doesn't do anything. See the note here: https://spacy.io/api/language#pipe

Here's their example code for doing this with multi-processing instead:

https://spacy.io/usage/examples#multi-processing

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM