繁体   English   中英

在 dataproc 操作员处作曲家工作流程失败

[英]composer workflow fails at dataproc operator

我在 gcp 中有一个 Composer 环境设置,它正在运行一个 DAG,如下所示

with DAG('sample-dataproc-dag',
     default_args=DEFAULT_DAG_ARGS,
     schedule_interval=None) as dag:  # Here we are using dag as context


# Submit the PySpark job.
submit_pyspark = DataProcPySparkOperator(
    task_id='run_dataproc_pyspark',
    main='gs://.../dataprocjob.py',  
    cluster_name='xyz',
    dataproc_pyspark_jars=
    'gs://.../spark-bigquery-latest_2.12.jar'
    )


simple_bash = BashOperator(
    task_id='simple-bash',
    bash_command="ls -la")

submit_pyspark.dag = dag
submit_pyspark.set_upstream(simple_bash)

这是我的 dataprocjob.py

from pyspark.sql import SparkSession



if __name__ == '__main__':

spark = SparkSession.builder.appName('Jupyter BigQuery Storage').getOrCreate()
table = "projct.dataset.txn_w_ah_demo"
df = spark.read.format("bigquery").option("table",table).load()
df.printSchema()

我的作曲家管道在 dataproc 步骤失败。 在gcs中存储的composer日志中,这是我看到的

[2020-09-23 21:40:02,849] {taskinstance.py:1059} ERROR - <HttpError 403 when requesting https://dataproc.googleapis.com/v1beta2/projects/lt-dia-pop-dis-upr/regions/global/jobs?clusterName=dppoppr004&alt=json returned "Not authorized to requested resource.">@-@{"workflow": "sample-dataproc-dag", "task-id": "run_dataproc_pyspark", "execution-date": "2020-09-23T21:39:42.371933+00:00"}
Traceback (most recent call last):
File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 930, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/airflow/airflow/contrib/operators/dataproc_operator.py", line 1139, in execute
super(DataProcPySparkOperator, self).execute(context)
File "/usr/local/lib/airflow/airflow/contrib/operators/dataproc_operator.py", line 707, in execute
self.hook.submit(self.hook.project_id, self.job, self.region, self.job_error_states)
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataproc_hook.py", line 311, in submit
num_retries=self.num_retries)
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataproc_hook.py", line 51, in __init__
clusterName=cluster_name).execute()
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 851, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://dataproc.googleapis.com/v1beta2/projects/lt-dia-pop-dis-upr/regions/global/jobs?clusterName=dppoppr004&alt=json returned "Not authorized to requested resource.">

乍一看,您调用 Dataproc API 的 Google Cloud 帐户的权限对于 Operator 来说是不够的。

您提出的问题似乎与您授予应用程序的 Dataproc 权限相对应。

根据文档,您需要不同的角色权限来执行 Dataproc 任务,例如:

dataproc.clusters.create permits the creation of Cloud Dataproc clusters in the containing project
dataproc.jobs.create permits the submission of Dataproc jobs to Dataproc clusters in the containing project
dataproc.clusters.list permits the listing of details of Dataproc clusters in the containing project

如果要创建提交 dataproc 作业,则需要“dataproc.clusters.use”和“dataproc.jobs.create”权限。

为了向您的用户帐户授予正确的权限,您可以按照更新您在代码中使用的服务帐户的文档并添加正确的权限。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM