简体   繁体   English

在没有 SSH 的 Dataproc 集群上从 airflow 触发 spark 提交作业

[英]Trigger spark submit jobs from airflow on Dataproc Cluster without SSH

currently, am executing my spark-submit commands in airflow by SSH using BashOperator & BashCommand but our client is not allowing us to do SSH into the cluster, is that possible to execute the Spark-submit command without SSH into cluster from airflow?目前,我在 airflow 中使用BashOperatorBashCommand在 airflow 中执行我的 spark-submit 命令,但是我们的客户不允许我们在集群中执行 SSH,是否可以在没有 SSH 的情况下从 airflow 执行Spark-submit命令?

You can use DataprocSubmitJobOperator to submit jobs in Airflow. Just make sure to pass correct parameters to the operator.您可以使用DataprocSubmitJobOperator在 Airflow 中提交作业。只需确保将正确的参数传递给操作员即可。 Take note that the job parameter is a dictionary based from Dataproc Job .请注意, job参数是基于Dataproc Job的字典。 So you can use this operator to submit different jobs like pyspark, pig, hive, etc.因此,您可以使用此运算符提交不同的作业,如 pyspark、猪、hive 等。

The code below submits a pyspark job:下面的代码提交了一个 pyspark 的工作:

import datetime

from airflow import models
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator

YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
PROJECT_ID = "my-project"
CLUSTER_NAME = "airflow-cluster" # name of created dataproc cluster
PYSPARK_URI = "gs://dataproc-examples/pyspark/hello-world/hello-world.py" # public sample script
REGION = "us-central1"

PYSPARK_JOB = {
    "reference": {"project_id": PROJECT_ID},
    "placement": {"cluster_name": CLUSTER_NAME},
    "pyspark_job": {"main_python_file_uri": PYSPARK_URI},
    }

default_args = {
    'owner': 'Composer Example',
    'depends_on_past': False,
    'email': [''],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'start_date': YESTERDAY,
}

with models.DAG(
        'submit_dataproc_spark',
        catchup=False,
        default_args=default_args,
        schedule_interval=datetime.timedelta(days=1)) as dag:

    submit_dataproc_job = DataprocSubmitJobOperator(
            task_id="pyspark_task", job=PYSPARK_JOB, region=REGION, project_id=PROJECT_ID
            )

    submit_dataproc_job

Airflow run: Airflow 运行:

在此处输入图像描述

Airflow logs: Airflow 日志:

在此处输入图像描述

Dataproc job:数据处理作业:

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Dataproc 集群执行 spark 作业时,执行程序检测信号在 125009 毫秒后超时 - Executor heartbeat timed out after 125009 ms while executing spark jobs from Dataproc cluster 我想在 dataproc 集群上提交一个带有自定义作业 ID 的 spark 作业 - I want to submit a spark job on the dataproc cluster, with custom job id 在 Airflow 1.10 重试创建 dataproc 集群 - Retry of dataproc cluster creation in Airflow 1.10 在 Dataproc 集群上部署时 Spark 应用程序失败 - Spark Application Failing when deployed on Dataproc cluster 在 dataproc 集群上运行时无法从 spark jar 中创建 bigquery 连接 - Not able to create bigquery connection from within spark jar when running on dataproc cluster 从 EMR 集群主机外部使用 spark-submit - Using spark-submit externally from EMR cluster master 如何使用 dataproc 客户端通过 google dataproc 集群作业为 spark 传递自定义作业 ID - how to pass custom job id via google dataproc cluster job for spark using dataproc client 您可以从 Dataproc 触发 Python 脚本吗? - Can you trigger Python Scripts from Dataproc? Spark 集群未动态分配资源给作业 - Spark cluster is not dynamically allocating resources to jobs :gcloud dataproc 作业提交 pyspark - 错误 AttributeError: 'str' object 没有属性 'batch' - !gcloud dataproc jobs submit pyspark - ERROR AttributeError: 'str' object has no attribute 'batch'
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM