将 requirements.txt 传递给 Google Cloud Pyspark 批处理作业

Question

I am trying to run a pyspark script as through a Google Dataproc Batch Job.我正在尝试通过 Google Dataproc 批处理作业运行 pyspark 脚本。

My script should connect to firestore to collect some data from there, so I need to access the library firebase-admin .我的脚本应该连接到 firestore 以从那里收集一些数据，所以我需要访问库firebase-admin 。 When I run the script on Google Cloud through the following command:当我通过以下命令在 Google Cloud 上运行脚本时：

gcloud dataproc batches submit \
        --project {PROJECT} \
        --region europe-west1 \
        --subnet {SUBNET} \
        pyspark spark_image_matching/main.py \
        --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
        --deps-bucket={DEPS_BUCKET}

I receive the following error:我收到以下错误：

Traceback (most recent call last):
  File "/tmp/srvls-batch-0127aaf6-a438-4439-af56-beb1a66f45ed/main.py", line 4, in <module>
    import firebase_admin
ModuleNotFoundError: No module named 'firebase_admin'

I already tried creating a setup.py file to generate an.egg file that specifies the dependency along with the --py-files flag.我已经尝试创建一个setup.py文件来生成一个 .egg 文件，该文件指定依赖项以及--py-files标志。 This idea was highly inspired by this repo:这个想法受到这个 repo 的高度启发：

http://www.restez-en-bonne-sante-leh.com/?_=%2FGoogleCloudPlatform%2Fdataproc-templates%2Fblob%2Fmain%2Fpython%2Fsetup.py%23BQyskaWdLgo6VQOkV2YyLaeS http://www.restez-en-bonne-sante-leh.com/?_=%2FGoogleCloudPlatform%2Fdataproc-templates%2Fblob%2Fmain%2Fpython%2Fsetup.py%23BQyskaWdLgo6VQOkV2YyLaeS

Answer 1

To customize Dataproc Serverless for Spark execution environment it is recommended to use custom container images: https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers要为 Spark 执行环境自定义 Dataproc Serverless，建议使用自定义容器映像： https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers

As an alternative you can take a look at Spark-supported ways of managing Python dependencies: https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html作为替代方案，您可以查看 Spark 支持的管理 Python 依赖项的方法： https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html

将 requirements.txt 传递给 Google Cloud Pyspark 批处理作业

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-10-03 20:45:56

将 requirements.txt 传递给 Google Cloud Pyspark 批处理作业

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-10-03 20:45:56

解决方案1
1 已采纳 2022-10-03 20:45:56