简体   繁体   English

dataproc 上的组件网关激活不适用于 composer(airflow) 运算符 airflow.providers.google.cloud.operators.dataproc

[英]Component Gateway activation on dataproc does not work with composer(airflow) operator airflow.providers.google.cloud.operators.dataproc

I'm trying execute this dag bellow.我正在尝试执行这个 dag bellow。 It seems that the operator creating a dataproc cluster does not enable enabling the optional components to enable jupyter notebook and anaconda. I found this code here: Component Gateway with DataprocOperator on Airflow to try to solve it, but for me it didn't solve it because i thikn the composer(airflow) version here is diferente.好像是operator creating a dataproc cluster does not enable enabling the optional components to enable the optional components to enable jupyter notebook and anaconda。我在这里找到这段代码: Component Gateway with DataprocOperator on Airflow试图解决它,但对我来说并没有解决它因为我认为这里的 composer(airflow) 版本不同。 my version is composer - 2.0.0-preview.5, airflow-2.1.4.我的版本是 composer - 2.0.0-preview.5, airflow-2.1.4。

The operator works perfectly when creating the cluster, but it doesn't create with the optional component to enable jupyter notebook. operator 在创建集群时完美运行,但它没有使用可选组件创建以启用 jupyter notebook。 Does anyone have any ideas to help me?有没有人有任何想法可以帮助我?

from airflow.contrib.sensors.gcs_sensor import GoogleCloudStoragePrefixSensor
from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.operators.dataproc_operator import DataprocClusterCreateOperator,DataprocClusterDeleteOperator, DataProcSparkOperator
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator

yesterday = datetime.combine(datetime.today() - timedelta(1),
                             datetime.min.time())


default_args = {
    'owner': 'teste3',
    'depends_on_past': False,
    'start_date' :yesterday,
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
    'retry_delay': timedelta(minutes=5),

}

dag = DAG(
    'teste-dag-3',catchup=False, default_args=default_args, schedule_interval=None)


# configura os componentes
class CustomDataprocClusterCreateOperator(DataprocClusterCreateOperator):

    def __init__(self, *args, **kwargs):
        super(CustomDataprocClusterCreateOperator, self).__init__(*args, **kwargs)

    def _build_cluster_data(self):
        cluster_data = super(CustomDataprocClusterCreateOperator, self)._build_cluster_data()
        cluster_data['config']['endpointConfig'] = {
            'enableHttpPortAccess': True
        }
        cluster_data['config']['softwareConfig']['optionalComponents'] = [ 'JUPYTER', 'ANACONDA' ]
        return cluster_data


create_cluster=CustomDataprocClusterCreateOperator(
        dag=dag,
        task_id='start_cluster_example',
        cluster_name='teste-ge-{{ ds }}',
        project_id= "sandbox-coe",
        num_workers=2,
        num_masters=1,
        master_machine_type='n2-standard-8',
        worker_machine_type='n2-standard-8',
        worker_disk_size=500,
        master_disk_size=500,
        master_disk_type='pd-ssd',
        worker_disk_type='pd-ssd',
        image_version='1.5.56-ubuntu18',
        tags=['allow-dataproc-internal'],
        region="us-central1",
        zone='us-central1-f',#Variable.get('gc_zone'),
        storage_bucket = "bucket-dataproc-ge",
        labels = {'product' : 'sample-label'},
        service_account_scopes = ['https://www.googleapis.com/auth/cloud-platform'],
        #properties={"yarn:yarn.nodemanager.resource.memory-mb" : 15360,"yarn:yarn.scheduler.maximum-allocation-mb" : 15360},
        #subnetwork_uri="projects/project-id/regions/us-central1/subnetworks/dataproc-subnet",
        retries= 1,
        retry_delay=timedelta(minutes=1)
    ) #starts a dataproc cluster


stop_cluster_example = DataprocClusterDeleteOperator(
    dag=dag,
    task_id='stop_cluster_example',
    cluster_name='teste-ge-{{ ds }}',
    project_id="sandbox-coe",
    region="us-central1",
    ) #stops a running dataproc cluster




create_cluster  >> stop_cluster_example

Edit: After took a deeper look you don't need a custom operator any more.编辑:深入了解后,您不再需要自定义运算符。 The updated operator DataprocCreateClusterOperator has enable_component_gateway and optional_components so you can just set them directly:更新后的运算符DataprocCreateClusterOperator具有enable_component_gatewayoptional_components ,因此您可以直接设置它们:

from airflow.providers.google.cloud.operators.dataproc import ClusterGenerator, DataprocCreateClusterOperator

CLUSTER_GENERATOR = ClusterGenerator(
    project_id=PROJECT_ID,
    region=REGION,
    ...,
    enable_component_gateway=True,
    optional_components = [ 'JUPYTER', 'ANACONDA' ]
).make()

DataprocCreateClusterOperator(
    ...,
    cluster_config=CLUSTER_GENERATOR
)

You can check this example dag for more details.您可以查看此示例 dag以了解更多详细信息。 You can view all possible parameters of ClusterGenerator in the source code .您可以在源代码中查看ClusterGenerator所有可能的参数。

Original Answer: The operator was re-written (see PR ).原始答案:重写了运算符(请参阅PR )。 I think the issue is with your _build_cluster_data function.我认为问题出在您的_build_cluster_data function 上。

You probably should change your code to:您可能应该将代码更改为:

def _build_cluster_data(self):
    cluster_data = super(CustomDataprocClusterCreateOperator, self)._build_cluster_data()
    cluster_data['config']['endpoint_config'] = {
        'enableHttpPortAccess': True
    }
    cluster_data['config']['software_config']['optional_components'] = [ 'JUPYTER', 'ANACONDA' ] # redundant see comment 2
    return cluster_data

A few notes:一些注意事项:

  1. CustomDataprocClusterCreateOperator is deprecated. CustomDataprocClusterCreateOperator 已弃用。 You should use DataprocCreateClusterOperator from the google provider .您应该使用来自google providerDataprocCreateClusterOperator

  2. You don't need to have cluster_data['config']['endpoint_config'] you can set the value directly by passing optional_components to the operator with see source code .您不需要cluster_data['config']['endpoint_config']您可以直接通过将optional_components传递给操作员来设置值,请参阅源代码

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 airflow 操作员使用 gcloud beta dataproc 命令 - airflow operator to use gcloud beta dataproc commands 如何在 airflow DAG 中的 secondary_worker_config 中创建 SPOT VM 以使用谷歌云数据处理运营商? - How to create SPOT VM's in my secondary_worker_config in airflow DAG for using google cloud dataproc operators? 无法在 Airflow 2.0 中使用“from airflow.providers.google.cloud.operators.bigquery import BigQueryOperator” - Not being able to use 'from airflow.providers.google.cloud.operators.bigquery import BigQueryOperator' in Airflow 2.0 Google Cloud Storage 到 Google Cloud SQL (Postgres) 运算符 Airflow(或 Composer) - Google Cloud Storage To Google Cloud SQL (Postgres) Operator in Airflow (or Composer) 在 Airflow 1.10 重试创建 dataproc 集群 - Retry of dataproc cluster creation in Airflow 1.10 损坏的 DAG:[/home/airflow/gcs/dags/composer-dataproc-dag.py] 无法导入名称“email” - Broken DAG: [/home/airflow/gcs/dags/composer-dataproc-dag.py] cannot import name 'email' Google Cloud Dataproc 支持的 OSS - OSS supported by Google Cloud Dataproc Airflow 上带有 DataprocOperator 的组件网关 - Component Gateway with DataprocOperator on Airflow 如何从 Google Cloud Composer 获取 Airflow 数据库凭据 - How to get Airflow db credentials from Google Cloud Composer 如何将数据库(postgres)连接到谷歌云平台上的 Airflow composer? - How to Connect Database(postgres) to Airflow composer On Google Cloud Platform?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM