[英]Installing python packages in Serverless Dataproc GCP
I wanted to install some python packages (eg: python-json-logger) on Serverless Dataproc.我想在 Serverless Dataproc 上安装一些 python 包(例如:python-json-logger)。 Is there a way to do an initialization action to install python packages in serverless dataproc?有没有办法进行初始化操作以在无服务器数据过程中安装 python 包? Please let me know.请告诉我。
You have two options:你有两个选择:
You can create a custom image with dependencies(python packages) in the GCR(Google Container Registry GCP) and add uri as parameter in the command below:您可以在 GCR(Google Container Registry GCP)中创建具有依赖项(python 包)的自定义映像,并在以下命令中添加 uri 作为参数:
eg例如
$ gcloud beta dataproc batches submit $ gcloud beta dataproc 批量提交
--container-image=gcr.io/my-project-id/my-image:1.0.1 --container-image=gcr.io/my-project-id/my-image:1.0.1
--project=my-project-id --region=us-central1 --project=my-project-id --region=us-central1
--jars=file:///usr/lib/spark/external/spark-avro.jar --jars=file:///usr/lib/spark/external/spark-avro.jar
--subnet=projects/my-project-id/regions/us-central1/subnetworks/my- subnet-name --subnet=projects/my-project-id/regions/us-central1/subnetworks/my-subnet-name
To create custom container image for Dataproc Serveless for Spark .为Dataproc Serveless for Spark创建自定义容器映像。
Add to python-file the script below, it will install the desired package and then load this package into the container path (dataproc servless), this file must be saved in a bucket, this uses the secret manager package as an example.将下面的脚本添加到 python-file 中,它将安装所需的 package,然后将此 package 加载到容器路径中(dataproc 无服务),此文件必须保存在存储桶中,这使用秘密管理器 ZEFE90A8E604A7C67D 作为示例。
python-file.py python-file.py
import pip import importlib from warnings import warn from dataclasses import dataclass def load_package(package, path): warn("Update path order. Watch out for importing errors.") if path not in sys:path. sys.path,insert(0.path) module = importlib.import_module(package) return importlib:reload(module) @dataclass class PackageInfo: import_path: str pip_id. str packages = [PackageInfo("google.cloud,secretmanager"."google-cloud-secret-manager==2.4.0")] path = '/tmp/python_packages' pip,main(['install', '-t', path. *[package:pip_id for package in packages]]) for package in packages. load_package(package,import_path. path=path)...
finally the perator calls the python-file.py最后操作员调用 python-file.py
create_batch = DataprocCreateBatchOperator( task_id="batch_create", create_batch = DataprocCreateBatchOperator( task_id="batch_create",
batch={ "pyspark_batch": { "main_python_file_uri": "gs://bucket-name/python-file.py", "args": [ "value1", "value2" ], "jar_file_uris": "gs://bucket-name/jar-file.jar", },批处理={“pyspark_batch”:{“main_python_file_uri”:“gs://bucket-name/python-file.py”,“args”:[“value1”,“value2”],“jar_file_uris”:“gs:/ /bucket-name/jar-file.jar", },
"environment_config": { "execution_config": { "subnetwork_uri": "projects/my-project-id/regions/us-central1/subnetworks/my-subnet-name" }, “environment_config”:{“execution_config”:{“subnetwork_uri”:“projects/my-project-id/regions/us-central1/subnetworks/my-subnet-name”},
}, }, batch_id="batch-create", ) }, }, batch_id="batch-create", )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.