简体   繁体   English

GCP| Composer Dataproc 提交作业| 未找到身份验证凭据

[英]GCP| Composer Dataproc submit job| Auth credential not found

I am running a GCP composer cluster on GKE.我在 GKE 上运行 GCP 作曲家集群。 I am defining a DAG to submit a job to dataproc cluster.我正在定义一个 DAG 来向 dataproc 集群提交作业。 I have read GCP doc, and it says that Composer's service account will get used by the workers to send the dataproc api requests.我已经阅读了 GCP 文档,它说 Composer 的服务帐户将被工作人员用来发送 dataproc api 请求。

But DataprocSubmitJobOperator reports error in getting the auth credentials.但是 DataprocSubmitJobOperator 在获取身份验证凭据时报告错误。 Stack trace below.下面的堆栈跟踪。 Composer env info attached.附上作曲家环境信息。 I need suggestion to fix this issue.我需要建议来解决这个问题。

[2022-08-23, 16:03:25 UTC] {taskinstance.py:1448} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=harshit.bapna@dexterity.ai
AIRFLOW_CTX_DAG_ID=dataproc_spark_operators
AIRFLOW_CTX_TASK_ID=pyspark_task
AIRFLOW_CTX_EXECUTION_DATE=2022-08-23T16:03:16.986859+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-08-23T16:03:16.986859+00:00
[2022-08-23, 16:03:25 UTC] {dataproc.py:1847} INFO - Submitting job
[2022-08-23, 16:03:25 UTC] {credentials_provider.py:312} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[2022-08-23, 16:03:25 UTC] {taskinstance.py:1776} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/operators/dataproc.py", line 1849, in execute
    job_object = self.hook.submit_job(
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 439, in inner_wrapper
    return func(self, *args, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/dataproc.py", line 869, in submit_job
    client = self.get_job_client(region=region)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/dataproc.py", line 258, in get_job_client
    credentials=self._get_credentials(), client_info=CLIENT_INFO, client_options=client_options
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 261, in _get_credentials
    credentials, _ = self._get_credentials_and_project_id()
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 240, in _get_credentials_and_project_id
    credentials, project_id = get_credentials_and_project_id(
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 321, in get_credentials_and_project_id
    return _CredentialProvider(*args, **kwargs).get_credentials_and_project()
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 229, in get_credentials_and_project
    credentials, project_id = self._get_credentials_using_adc()
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 307, in _get_credentials_using_adc
    credentials, project_id = google.auth.default(scopes=self.scopes)
  File "/opt/python3.8/lib/python3.8/site-packages/google/auth/_default.py", line 459, in default
    credentials, project_id = checker()
  File "/opt/python3.8/lib/python3.8/site-packages/google/auth/_default.py", line 221, in _get_explicit_environ_credentials
    credentials, project_id = load_credentials_from_file(
  File "/opt/python3.8/lib/python3.8/site-packages/google/auth/_default.py", line 107, in load_credentials_from_file
    raise exceptions.DefaultCredentialsError(
google.auth.exceptions.DefaultCredentialsError: File celery was not found.
[2022-08-23, 16:03:25 UTC] {taskinstance.py:1279} INFO - Marking task as UP_FOR_RETRY. dag_id=dataproc_spark_operators, task_id=pyspark_task, execution_date=20220823T160316, start_date=20220823T160324, end_date=20220823T160325
[2022-08-23, 16:03:25 UTC] {standard_task_runner.py:93} ERROR - Failed to execute job 32837 for task pyspark_task (File celery was not found.; 356144)
[2022-08-23, 16:03:26 UTC] {local_task_job.py:154} INFO - Task exited with return code 1
[2022-08-23, 16:03:26 UTC] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check

GCP Composer Env GCP 作曲家环境

Based on the error File celery was not found , I think that the Application Default Credentials (ADC) tries to read a file named celery , and it doesn't find it, so check if you set the environment variable GOOGLE_APPLICATION_CREDENTIALS , because if you set it, ADC will read the the file to use it:基于错误File celery was not found ,我认为 Application Default Credentials (ADC) 尝试读取名为celery的文件,但找不到它,因此请检查您是否设置了环境变量GOOGLE_APPLICATION_CREDENTIALS ,因为如果它,ADC 将读取文件以使用它:

  • If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set, ADC uses the service account key or configuration file that the variable points to.如果设置了环境变量 GOOGLE_APPLICATION_CREDENTIALS,ADC 将使用该变量指向的服务帐户密钥或配置文件。
  • If the environment variable GOOGLE_APPLICATION_CREDENTIALS isn't set, ADC uses the service account that is attached to the resource that is running your code.如果未设置环境变量 GOOGLE_APPLICATION_CREDENTIALS,ADC 将使用附加到运行代码的资源的服务帐户。
    This service account might be a default service account provided by Compute Engine, Google Kubernetes Engine, App Engine, Cloud Run, or Cloud Functions.此服务帐号可能是 Compute Engine、Google Kubernetes Engine、App Engine、Cloud Run 或 Cloud Functions 提供的默认服务帐号。 It might also be a user-managed service account that you created.它也可能是您创建的用户管理的服务帐户。

GCP doc GCP 文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM