如何导入 Spacy 以使用 GCP Dataflow 运行？

Question

I would like to run Spacy Lemmatization on a column within a ParDo on GCP DataFlow.我想在 GCP DataFlow 上的ParDo中的列上运行 Spacy Lemmatization。

My DataFlow project is composed by 3 files: main.py which is the file containing the script, myfile.json which contains the service account key, and setup.py which contains the requirements for the project:我的 DataFlow 项目由 3 个文件组成： main.py是包含脚本的文件， myfile.json包含服务帐户密钥， setup.py包含项目要求：

main.py主文件

import apache_beam as beam
from apache_beam.io.gcp.internal.clients import bigquery
from apache_beam.options.pipeline_options import PipelineOptions
import unidecode
import string
import spacy
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "myfile.json"

table_spec = bigquery.TableReference(
    projectId='scrappers-293910',
    datasetId='mydataset',
    tableId='mytable')

options = PipelineOptions(
  job_name="lemmatize-job-offers-description-2",
  project="myproject",
  region="europe-west6",
  temp_location="gs://mygcp/options/temp_location/",
  staging_location="gs://mygcp/options/staging_location/")

nlp = spacy.load("fr_core_news_sm", disable=["tagger", "parser", "attribute_ruler", "ner", "textcat"])

class CleanText(beam.DoFn):
  def process(self, row):
    row['descriptioncleaned'] = ' '.join(unidecode.unidecode(str(row['description'])).lower().translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))).split())
    yield row

class LemmaText(beam.DoFn):
  def process(self, row):
    doc = nlp(row['descriptioncleaned'])
    row['descriptionlemmatized'] = ' '.join(list(set([token.lemma_ for token in doc])))
    yield row

with beam.Pipeline(runner="DataflowRunner", options=options) as pipeline:
  soft = pipeline \
  | "ReadFromBigQuery" >> beam.io.ReadFromBigQuery(table=table_spec, gcs_location="gs://mygcp/gcs_location") \
  | "CleanText" >> beam.ParDo(CleanText()) \
  | "LemmaText" >> beam.ParDo(LemmaText()) \
  | 'WriteToBigQuery' >> beam.io.WriteToBigQuery('mybq.path', custom_gcs_temp_location="gs://mygcp/gcs_temp_location", create_disposition="CREATE_IF_NEEDED", write_disposition="WRITE_TRUNCATE")

setup.py安装程序.py

import setuptools

setuptools.setup(
    name='PACKAGE-NAME',
    install_requires=['spacy', 'unidecode', 'fr_core_news_lg @ git+https://github.com/explosion/spacy-models/releases/download/fr_core_news_lg-3.2.0/fr_core_news_lg-3.2.0.tar.gz'],
    packages=setuptools.find_packages()
)

and I send the job to DataFlow with the above cmd:我使用上述 cmd 将作业发送到 DataFlow：

python3 main.py --setup_file ./setup.py

Locally it works fine, but as soon as I send it to DataFlow, after few minutes I get:在本地它工作正常，但是一旦我将它发送到 DataFlow，几分钟后我得到：

I searched for the reason and it seems to be the module dependencies.我搜索了原因，似乎是模块依赖项。

Is it alright to import the Spacy model like I did?可以像我一样导入 Spacy model 吗？ What am I doing wrong?我究竟做错了什么？

Answer 1

https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ . https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ 。 It seems that you can use a requirements file with requirements_file pipeline option.您似乎可以使用带有requirements_file管道选项的需求文件。

Additionally, if you run into name error, see https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors .此外，如果您遇到名称错误，请参阅https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors 。

如何导入 Spacy 以使用 GCP Dataflow 运行？

问题描述

1 个解决方案

解决方案1
0 2022-01-06 18:32:44

如何导入 Spacy 以使用 GCP Dataflow 运行？

问题描述

1 个解决方案

解决方案1 0 2022-01-06 18:32:44

解决方案1
0 2022-01-06 18:32:44