[英]How to import Spacy to run with GCP Dataflow?
I would like to run Spacy Lemmatization on a column within a ParDo
on GCP DataFlow.我想在 GCP DataFlow 上的ParDo
中的列上运行 Spacy Lemmatization。
My DataFlow project is composed by 3 files: main.py
which is the file containing the script, myfile.json
which contains the service account key, and setup.py
which contains the requirements for the project:我的 DataFlow 项目由 3 个文件组成: main.py
是包含脚本的文件, myfile.json
包含服务帐户密钥, setup.py
包含项目要求:
main.py主文件
import apache_beam as beam
from apache_beam.io.gcp.internal.clients import bigquery
from apache_beam.options.pipeline_options import PipelineOptions
import unidecode
import string
import spacy
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "myfile.json"
table_spec = bigquery.TableReference(
projectId='scrappers-293910',
datasetId='mydataset',
tableId='mytable')
options = PipelineOptions(
job_name="lemmatize-job-offers-description-2",
project="myproject",
region="europe-west6",
temp_location="gs://mygcp/options/temp_location/",
staging_location="gs://mygcp/options/staging_location/")
nlp = spacy.load("fr_core_news_sm", disable=["tagger", "parser", "attribute_ruler", "ner", "textcat"])
class CleanText(beam.DoFn):
def process(self, row):
row['descriptioncleaned'] = ' '.join(unidecode.unidecode(str(row['description'])).lower().translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))).split())
yield row
class LemmaText(beam.DoFn):
def process(self, row):
doc = nlp(row['descriptioncleaned'])
row['descriptionlemmatized'] = ' '.join(list(set([token.lemma_ for token in doc])))
yield row
with beam.Pipeline(runner="DataflowRunner", options=options) as pipeline:
soft = pipeline \
| "ReadFromBigQuery" >> beam.io.ReadFromBigQuery(table=table_spec, gcs_location="gs://mygcp/gcs_location") \
| "CleanText" >> beam.ParDo(CleanText()) \
| "LemmaText" >> beam.ParDo(LemmaText()) \
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('mybq.path', custom_gcs_temp_location="gs://mygcp/gcs_temp_location", create_disposition="CREATE_IF_NEEDED", write_disposition="WRITE_TRUNCATE")
setup.py安装程序.py
import setuptools
setuptools.setup(
name='PACKAGE-NAME',
install_requires=['spacy', 'unidecode', 'fr_core_news_lg @ git+https://github.com/explosion/spacy-models/releases/download/fr_core_news_lg-3.2.0/fr_core_news_lg-3.2.0.tar.gz'],
packages=setuptools.find_packages()
)
and I send the job to DataFlow with the above cmd:我使用上述 cmd 将作业发送到 DataFlow:
python3 main.py --setup_file ./setup.py
Locally it works fine, but as soon as I send it to DataFlow, after few minutes I get:在本地它工作正常,但是一旦我将它发送到 DataFlow,几分钟后我得到:
I searched for the reason and it seems to be the module dependencies.我搜索了原因,似乎是模块依赖项。
Is it alright to import the Spacy model like I did?可以像我一样导入 Spacy model 吗? What am I doing wrong?我究竟做错了什么?
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ . https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ 。 It seems that you can use a requirements file with requirements_file pipeline option.您似乎可以使用带有requirements_file管道选项的需求文件。
Additionally, if you run into name error, see https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors .此外,如果您遇到名称错误,请参阅https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.