简体   繁体   English

将 requirements.txt 传递给 Google Cloud Pyspark 批处理作业

[英]Passing requirements.txt to Google Cloud Pyspark Batch Job

I am trying to run a pyspark script as through a Google Dataproc Batch Job.我正在尝试通过 Google Dataproc 批处理作业运行 pyspark 脚本。

My script should connect to firestore to collect some data from there, so I need to access the library firebase-admin .我的脚本应该连接到 firestore 以从那里收集一些数据,所以我需要访问库firebase-admin When I run the script on Google Cloud through the following command:当我通过以下命令在 Google Cloud 上运行脚本时:

gcloud dataproc batches submit \
        --project {PROJECT} \
        --region europe-west1 \
        --subnet {SUBNET} \
        pyspark spark_image_matching/main.py \
        --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
        --deps-bucket={DEPS_BUCKET} 

I receive the following error:我收到以下错误:

Traceback (most recent call last):
  File "/tmp/srvls-batch-0127aaf6-a438-4439-af56-beb1a66f45ed/main.py", line 4, in <module>
    import firebase_admin
ModuleNotFoundError: No module named 'firebase_admin'

I already tried creating a setup.py file to generate an.egg file that specifies the dependency along with the --py-files flag.我已经尝试创建一个setup.py文件来生成一个 .egg 文件,该文件指定依赖项以及--py-files标志。 This idea was highly inspired by this repo:这个想法受到这个 repo 的高度启发:

http://www.restez-en-bonne-sante-leh.com/?_=%2FGoogleCloudPlatform%2Fdataproc-templates%2Fblob%2Fmain%2Fpython%2Fsetup.py%23BQyskaWdLgo6VQOkV2YyLaeS http://www.restez-en-bonne-sante-leh.com/?_=%2FGoogleCloudPlatform%2Fdataproc-templates%2Fblob%2Fmain%2Fpython%2Fsetup.py%23BQyskaWdLgo6VQOkV2YyLaeS

To customize Dataproc Serverless for Spark execution environment it is recommended to use custom container images: https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers要为 Spark 执行环境自定义 Dataproc Serverless,建议使用自定义容器映像: https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers

As an alternative you can take a look at Spark-supported ways of managing Python dependencies: https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html作为替代方案,您可以查看 Spark 支持的管理 Python 依赖项的方法: https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Google Cloud Run Second Flask 应用程序 - requirements.txt 问题 - Google Cloud Run Second Flask Application - requirements.txt issue Cloud Functions 部署 requirements.txt 失败 - Cloud Functions deployment requirements.txt failing GAE Python - 我的 requirements.txt 上的库未安装 - GAE Python - libs on my requirements.txt are not installed 如何解决 Django 部署中 requirements.txt 文件中的依赖冲突 - How to solve dependency conflicts in requirements.txt file in Django deployment 如何在apache airflow DAG中运行独立requirements.txt文件 - How to run independent requirements.txt file in apache airflow DAG Google Dataproc 的自定义容器映像 pyspark 批处理作业 - Custom Container Image for Google Dataproc pyspark Batch Job 错误:无法打开需求文件:[Errno 2] 没有这样的文件或目录:'requirements.txt' 使用 AWS Lambda 和 Python 时 - ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt' When using AWS Lambda and Python AWS Elastic Beanstalk 无法使用 requirements.txt 安装 psycopg2 Git Pip - AWS Elastic Beanstalk failed to install psycopg2 using requirements.txt Git Pip Googld cloud dataproc serverless (batch) pyspark 从谷歌云存储 (GCS) 读取镶木地板文件非常慢 - Googld cloud dataproc serverless (batch) pyspark reads parquet file from google cloud storage (GCS) very slow 在 SageMaker 的处理过程中,在 source_dir 中使用 requirements.txt 重新打包推理 model 而无需安装它们 - Repack inference model with requirements.txt inside source_dir without installing them during the process in SageMaker
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM