[英]How to import Spacy to run with GCP Dataflow?
我想在 GCP DataFlow 上的ParDo
中的列上运行 Spacy Lemmatization。
我的 DataFlow 项目由 3 个文件组成: main.py
是包含脚本的文件, myfile.json
包含服务帐户密钥, setup.py
包含项目要求:
主文件
import apache_beam as beam
from apache_beam.io.gcp.internal.clients import bigquery
from apache_beam.options.pipeline_options import PipelineOptions
import unidecode
import string
import spacy
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "myfile.json"
table_spec = bigquery.TableReference(
projectId='scrappers-293910',
datasetId='mydataset',
tableId='mytable')
options = PipelineOptions(
job_name="lemmatize-job-offers-description-2",
project="myproject",
region="europe-west6",
temp_location="gs://mygcp/options/temp_location/",
staging_location="gs://mygcp/options/staging_location/")
nlp = spacy.load("fr_core_news_sm", disable=["tagger", "parser", "attribute_ruler", "ner", "textcat"])
class CleanText(beam.DoFn):
def process(self, row):
row['descriptioncleaned'] = ' '.join(unidecode.unidecode(str(row['description'])).lower().translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))).split())
yield row
class LemmaText(beam.DoFn):
def process(self, row):
doc = nlp(row['descriptioncleaned'])
row['descriptionlemmatized'] = ' '.join(list(set([token.lemma_ for token in doc])))
yield row
with beam.Pipeline(runner="DataflowRunner", options=options) as pipeline:
soft = pipeline \
| "ReadFromBigQuery" >> beam.io.ReadFromBigQuery(table=table_spec, gcs_location="gs://mygcp/gcs_location") \
| "CleanText" >> beam.ParDo(CleanText()) \
| "LemmaText" >> beam.ParDo(LemmaText()) \
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('mybq.path', custom_gcs_temp_location="gs://mygcp/gcs_temp_location", create_disposition="CREATE_IF_NEEDED", write_disposition="WRITE_TRUNCATE")
安装程序.py
import setuptools
setuptools.setup(
name='PACKAGE-NAME',
install_requires=['spacy', 'unidecode', 'fr_core_news_lg @ git+https://github.com/explosion/spacy-models/releases/download/fr_core_news_lg-3.2.0/fr_core_news_lg-3.2.0.tar.gz'],
packages=setuptools.find_packages()
)
我使用上述 cmd 将作业发送到 DataFlow:
python3 main.py --setup_file ./setup.py
在本地它工作正常,但是一旦我将它发送到 DataFlow,几分钟后我得到:
我搜索了原因,似乎是模块依赖项。
可以像我一样导入 Spacy model 吗? 我究竟做错了什么?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.