繁体   English   中英

如何导入 Spacy 以使用 GCP Dataflow 运行?

[英]How to import Spacy to run with GCP Dataflow?

我想在 GCP DataFlow 上的ParDo中的列上运行 Spacy Lemmatization。

我的 DataFlow 项目由 3 个文件组成: main.py是包含脚本的文件, myfile.json包含服务帐户密钥, setup.py包含项目要求:

主文件

import apache_beam as beam
from apache_beam.io.gcp.internal.clients import bigquery
from apache_beam.options.pipeline_options import PipelineOptions
import unidecode
import string
import spacy
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "myfile.json"

table_spec = bigquery.TableReference(
    projectId='scrappers-293910',
    datasetId='mydataset',
    tableId='mytable')

options = PipelineOptions(
  job_name="lemmatize-job-offers-description-2",
  project="myproject",
  region="europe-west6",
  temp_location="gs://mygcp/options/temp_location/",
  staging_location="gs://mygcp/options/staging_location/")

nlp = spacy.load("fr_core_news_sm", disable=["tagger", "parser", "attribute_ruler", "ner", "textcat"])

class CleanText(beam.DoFn):
  def process(self, row):
    row['descriptioncleaned'] = ' '.join(unidecode.unidecode(str(row['description'])).lower().translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))).split())
    yield row

class LemmaText(beam.DoFn):
  def process(self, row):
    doc = nlp(row['descriptioncleaned'])
    row['descriptionlemmatized'] = ' '.join(list(set([token.lemma_ for token in doc])))
    yield row

with beam.Pipeline(runner="DataflowRunner", options=options) as pipeline:
  soft = pipeline \
  | "ReadFromBigQuery" >> beam.io.ReadFromBigQuery(table=table_spec, gcs_location="gs://mygcp/gcs_location") \
  | "CleanText" >> beam.ParDo(CleanText()) \
  | "LemmaText" >> beam.ParDo(LemmaText()) \
  | 'WriteToBigQuery' >> beam.io.WriteToBigQuery('mybq.path', custom_gcs_temp_location="gs://mygcp/gcs_temp_location", create_disposition="CREATE_IF_NEEDED", write_disposition="WRITE_TRUNCATE")

安装程序.py

import setuptools

setuptools.setup(
    name='PACKAGE-NAME',
    install_requires=['spacy', 'unidecode', 'fr_core_news_lg @ git+https://github.com/explosion/spacy-models/releases/download/fr_core_news_lg-3.2.0/fr_core_news_lg-3.2.0.tar.gz'],
    packages=setuptools.find_packages()
)

我使用上述 cmd 将作业发送到 DataFlow:

python3 main.py --setup_file ./setup.py

在本地它工作正常,但是一旦我将它发送到 DataFlow,几分钟后我得到:

漏洞

我搜索了原因,似乎是模块依赖项。

可以像我一样导入 Spacy model 吗? 我究竟做错了什么?

https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ 您似乎可以使用带有requirements_file管道选项的需求文件。

此外,如果您遇到名称错误,请参阅https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM