how to speed up lemmatization using spacy.pipe on text?

Question

How can I speed up lemmatization on text set using spacy pipe? Currently I'm using like this,

nlp = spacy.load('en_core_web_lg')
df['text'].apply(lambda x: len(nlp(x).ents)) # returns number of named entities

How can I use extract number of named entities using nlp.pipe with batch_size, threads etc.. and take advantage of multiprocesses?

spacy_nlp.pipe(df['text'], n_threads=6, batch_size=10)

Answer 1

I don't think pipe has an n_threads parameter but SpaCy 2.2.2 does provide a n_process parameters that indicates the "...Number of processors to use". So, you could distribute the load over your CPUs at the very least.

I'm honestly not sure how much you're going to be able to parallelize the processing of a single text. SpaCy is pretty damn fast already.

If you have a bunches of texts or extremely large texts that can be broken into sections, then you may benefit from distributing the load using something like the ProcessPoolExecutor , which distributes processing over a number of processes.

Unfortunately, because of Pythons notorious Global Interpreter Lock , you're going to struggle to squeeze performance out via thread parallelization unless the processing is happening in ac library behind the scenes (like Numpy and TensorFlow).

how to speed up lemmatization using spacy.pipe on text?

Question

1 answers

solution1
0 2020-03-04 06:55:39

how to speed up lemmatization using spacy.pipe on text?

Question

1 answers

solution1 0 2020-03-04 06:55:39

solution1
0 2020-03-04 06:55:39