简体   繁体   中英

how to speed up lemmatization using spacy.pipe on text?

How can I speed up lemmatization on text set using spacy pipe? Currently I'm using like this,

nlp = spacy.load('en_core_web_lg')
df['text'].apply(lambda x: len(nlp(x).ents)) # returns number of named entities

How can I use extract number of named entities using nlp.pipe with batch_size, threads etc.. and take advantage of multiprocesses?

spacy_nlp.pipe(df['text'], n_threads=6, batch_size=10)

I don't think pipe has an n_threads parameter but SpaCy 2.2.2 does provide a n_process parameters that indicates the "...Number of processors to use". So, you could distribute the load over your CPUs at the very least.

I'm honestly not sure how much you're going to be able to parallelize the processing of a single text. SpaCy is pretty damn fast already.

If you have a bunches of texts or extremely large texts that can be broken into sections, then you may benefit from distributing the load using something like the ProcessPoolExecutor , which distributes processing over a number of processes.

Unfortunately, because of Pythons notorious Global Interpreter Lock , you're going to struggle to squeeze performance out via thread parallelization unless the processing is happening in ac library behind the scenes (like Numpy and TensorFlow).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM