[英]Component Gateway activation on dataproc does not work with composer(airflow) operator airflow.providers.google.cloud.operators.dataproc
I'm trying execute this dag bellow.我正在尝试执行这个 dag bellow。 It seems that the operator creating a dataproc cluster does not enable enabling the optional components to enable jupyter notebook and anaconda. I found this code here: Component Gateway with DataprocOperator on Airflow to try to solve it, but for me it didn't solve it because i thikn the composer(airflow) version here is diferente.
好像是operator creating a dataproc cluster does not enable enabling the optional components to enable the optional components to enable jupyter notebook and anaconda。我在这里找到这段代码: Component Gateway with DataprocOperator on Airflow试图解决它,但对我来说并没有解决它因为我认为这里的 composer(airflow) 版本不同。 my version is composer - 2.0.0-preview.5, airflow-2.1.4.
我的版本是 composer - 2.0.0-preview.5, airflow-2.1.4。
The operator works perfectly when creating the cluster, but it doesn't create with the optional component to enable jupyter notebook. operator 在创建集群时完美运行,但它没有使用可选组件创建以启用 jupyter notebook。 Does anyone have any ideas to help me?
有没有人有任何想法可以帮助我?
from airflow.contrib.sensors.gcs_sensor import GoogleCloudStoragePrefixSensor
from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.operators.dataproc_operator import DataprocClusterCreateOperator,DataprocClusterDeleteOperator, DataProcSparkOperator
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
yesterday = datetime.combine(datetime.today() - timedelta(1),
datetime.min.time())
default_args = {
'owner': 'teste3',
'depends_on_past': False,
'start_date' :yesterday,
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'teste-dag-3',catchup=False, default_args=default_args, schedule_interval=None)
# configura os componentes
class CustomDataprocClusterCreateOperator(DataprocClusterCreateOperator):
def __init__(self, *args, **kwargs):
super(CustomDataprocClusterCreateOperator, self).__init__(*args, **kwargs)
def _build_cluster_data(self):
cluster_data = super(CustomDataprocClusterCreateOperator, self)._build_cluster_data()
cluster_data['config']['endpointConfig'] = {
'enableHttpPortAccess': True
}
cluster_data['config']['softwareConfig']['optionalComponents'] = [ 'JUPYTER', 'ANACONDA' ]
return cluster_data
create_cluster=CustomDataprocClusterCreateOperator(
dag=dag,
task_id='start_cluster_example',
cluster_name='teste-ge-{{ ds }}',
project_id= "sandbox-coe",
num_workers=2,
num_masters=1,
master_machine_type='n2-standard-8',
worker_machine_type='n2-standard-8',
worker_disk_size=500,
master_disk_size=500,
master_disk_type='pd-ssd',
worker_disk_type='pd-ssd',
image_version='1.5.56-ubuntu18',
tags=['allow-dataproc-internal'],
region="us-central1",
zone='us-central1-f',#Variable.get('gc_zone'),
storage_bucket = "bucket-dataproc-ge",
labels = {'product' : 'sample-label'},
service_account_scopes = ['https://www.googleapis.com/auth/cloud-platform'],
#properties={"yarn:yarn.nodemanager.resource.memory-mb" : 15360,"yarn:yarn.scheduler.maximum-allocation-mb" : 15360},
#subnetwork_uri="projects/project-id/regions/us-central1/subnetworks/dataproc-subnet",
retries= 1,
retry_delay=timedelta(minutes=1)
) #starts a dataproc cluster
stop_cluster_example = DataprocClusterDeleteOperator(
dag=dag,
task_id='stop_cluster_example',
cluster_name='teste-ge-{{ ds }}',
project_id="sandbox-coe",
region="us-central1",
) #stops a running dataproc cluster
create_cluster >> stop_cluster_example
Edit: After took a deeper look you don't need a custom operator any more.编辑:深入了解后,您不再需要自定义运算符。 The updated operator
DataprocCreateClusterOperator
has enable_component_gateway
and optional_components
so you can just set them directly:更新后的运算符
DataprocCreateClusterOperator
具有enable_component_gateway
和optional_components
,因此您可以直接设置它们:
from airflow.providers.google.cloud.operators.dataproc import ClusterGenerator, DataprocCreateClusterOperator
CLUSTER_GENERATOR = ClusterGenerator(
project_id=PROJECT_ID,
region=REGION,
...,
enable_component_gateway=True,
optional_components = [ 'JUPYTER', 'ANACONDA' ]
).make()
DataprocCreateClusterOperator(
...,
cluster_config=CLUSTER_GENERATOR
)
You can check this example dag for more details.您可以查看此示例 dag以了解更多详细信息。 You can view all possible parameters of
ClusterGenerator
in the source code .您可以在源代码中查看
ClusterGenerator
所有可能的参数。
Original Answer: The operator was re-written (see PR ).原始答案:重写了运算符(请参阅PR )。 I think the issue is with your
_build_cluster_data
function.我认为问题出在您的
_build_cluster_data
function 上。
You probably should change your code to:您可能应该将代码更改为:
def _build_cluster_data(self):
cluster_data = super(CustomDataprocClusterCreateOperator, self)._build_cluster_data()
cluster_data['config']['endpoint_config'] = {
'enableHttpPortAccess': True
}
cluster_data['config']['software_config']['optional_components'] = [ 'JUPYTER', 'ANACONDA' ] # redundant see comment 2
return cluster_data
A few notes:一些注意事项:
CustomDataprocClusterCreateOperator is deprecated. CustomDataprocClusterCreateOperator 已弃用。 You should use
DataprocCreateClusterOperator
from the google provider .您应该使用来自google provider的
DataprocCreateClusterOperator
。
You don't need to have cluster_data['config']['endpoint_config']
you can set the value directly by passing optional_components
to the operator with see source code .您不需要
cluster_data['config']['endpoint_config']
您可以直接通过将optional_components
传递给操作员来设置值,请参阅源代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.