Correct multithreaded lemmatization using spaCy

Question

I'm trying to multithread the lemmatization of my corpus using spaCy. Following the documentation , this is currently my approach:

import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'tagger'])

def lemmatize():
    for doc in nlp.pipe(corpus, batch_size=2, n_threads=10):
        yield ' '.join([token.lemma_ for token in doc])

new_corpus = list(lemmatize())

However, this takes the same amount of time regardless when using 10 or 1 thread (I use it on 100.000 documents), suggesting that it is not multithreaded.

Is my implementation wrong?

Answer 1

The n_threads argument has been deprecated in newer versions of spacy and doesn't do anything. See the note here: https://spacy.io/api/language#pipe

Here's their example code for doing this with multi-processing instead:

https://spacy.io/usage/examples#multi-processing

Correct multithreaded lemmatization using spaCy

Question

1 answers

solution1
2 ACCPTED 2019-06-25 18:16:38

Correct multithreaded lemmatization using spaCy

Question

1 answers

solution1 2 ACCPTED 2019-06-25 18:16:38

solution1
2 ACCPTED 2019-06-25 18:16:38