I try to process data coming from BigQuery.
I created a pipeline with Apache Beam as below:
nlp = fr_core_news_lg.load()
class CleanText(beam.DoFn):
def process(self, row):
row['descriptioncleaned'] = ' '.join(unidecode.unidecode(str(row['description'])).lower().translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))).split())
yield row
class LemmaText(beam.DoFn):
def process(self, row):
doc = nlp(row['descriptioncleaned'], disable=["tagger", "parser", "attribute_ruler", "ner", "textcat"])
row['descriptionlemmatized'] = ' '.join(list(set([token.lemma_ for token in doc])))
yield row
with beam.Pipeline(runner="direct", options=options) as pipeline:
soft = pipeline \
| "GetRows" >> beam.io.ReadFromBigQuery(table=table_spec, gcs_location="gs://mygs") \
| "CleanText" >> beam.ParDo(CleanText()) \
| "LemmaText" >> beam.ParDo(LemmaText()) \
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('mybq', custom_gcs_temp_location="gs://mygs", create_disposition="CREATE_IF_NEEDED", write_disposition="WRITE_TRUNCATE")
Basically, it loads data from my BigQuery table, clean one of the columns (of type string), and lemmatize it using Spacy Lemmatizer. I have approx. 8M lines and each string is pretty big, approx. 300 words.
At the end it all sums up and takes more than 15 hours to complete. And we have to run it everydays.
I don't really understand why it is taking so long to run on DataFlow which is supposed to in a parallelized way.
I already used pipe
from Spacy but I can't really make it work with Apache Beam.
Is there a way to speed up Spacy processing using DataFlow or parallelize it better?
I don't know anything about Dataflow, but some observations about your spaCy usage...
You're using the lemmatizer without the tagger. That's comparatively fast but low quality, because good lemmas rely on part of speech. But if you're doing that you should use nlp.blank("fr")
and add the lemmatizer to it, as otherwise the tok2vec
encoding layer will still be run despite not being used, and that's probably slower than the lemmatizer. (It's also not clear to me how often the spacy.load
call is executed, but the call to blank
is much faster.
I'm not sure you can map it to Dataflow, but you want to use nlp.pipe
if possible for speed. On the other hand, if you're not using tok2vec or any statistical components it probably won't make much difference. Also see the speed FAQ .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.