How to import Spacy to run with GCP Dataflow?

Question

I would like to run Spacy Lemmatization on a column within a ParDo on GCP DataFlow.

My DataFlow project is composed by 3 files: main.py which is the file containing the script, myfile.json which contains the service account key, and setup.py which contains the requirements for the project:

main.py

import apache_beam as beam
from apache_beam.io.gcp.internal.clients import bigquery
from apache_beam.options.pipeline_options import PipelineOptions
import unidecode
import string
import spacy
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "myfile.json"

table_spec = bigquery.TableReference(
    projectId='scrappers-293910',
    datasetId='mydataset',
    tableId='mytable')

options = PipelineOptions(
  job_name="lemmatize-job-offers-description-2",
  project="myproject",
  region="europe-west6",
  temp_location="gs://mygcp/options/temp_location/",
  staging_location="gs://mygcp/options/staging_location/")

nlp = spacy.load("fr_core_news_sm", disable=["tagger", "parser", "attribute_ruler", "ner", "textcat"])

class CleanText(beam.DoFn):
  def process(self, row):
    row['descriptioncleaned'] = ' '.join(unidecode.unidecode(str(row['description'])).lower().translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))).split())
    yield row

class LemmaText(beam.DoFn):
  def process(self, row):
    doc = nlp(row['descriptioncleaned'])
    row['descriptionlemmatized'] = ' '.join(list(set([token.lemma_ for token in doc])))
    yield row

with beam.Pipeline(runner="DataflowRunner", options=options) as pipeline:
  soft = pipeline \
  | "ReadFromBigQuery" >> beam.io.ReadFromBigQuery(table=table_spec, gcs_location="gs://mygcp/gcs_location") \
  | "CleanText" >> beam.ParDo(CleanText()) \
  | "LemmaText" >> beam.ParDo(LemmaText()) \
  | 'WriteToBigQuery' >> beam.io.WriteToBigQuery('mybq.path', custom_gcs_temp_location="gs://mygcp/gcs_temp_location", create_disposition="CREATE_IF_NEEDED", write_disposition="WRITE_TRUNCATE")

setup.py

import setuptools

setuptools.setup(
    name='PACKAGE-NAME',
    install_requires=['spacy', 'unidecode', 'fr_core_news_lg @ git+https://github.com/explosion/spacy-models/releases/download/fr_core_news_lg-3.2.0/fr_core_news_lg-3.2.0.tar.gz'],
    packages=setuptools.find_packages()
)

and I send the job to DataFlow with the above cmd:

python3 main.py --setup_file ./setup.py

Locally it works fine, but as soon as I send it to DataFlow, after few minutes I get:

I searched for the reason and it seems to be the module dependencies.

Is it alright to import the Spacy model like I did? What am I doing wrong?

Answer 1

https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ . It seems that you can use a requirements file with requirements_file pipeline option.

Additionally, if you run into name error, see https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors .

How to import Spacy to run with GCP Dataflow?

Question

1 answers

solution1
0 2022-01-06 18:32:44

How to import Spacy to run with GCP Dataflow?

Question

1 answers

solution1 0 2022-01-06 18:32:44

solution1
0 2022-01-06 18:32:44