繁体   English   中英

Airflow 2:将数据从 BigQuery 传输到 Cloud Storage 时找不到作业

[英]Airflow 2: Job Not Found when transferring data from BigQuery into Cloud Storage

我正在尝试从 Cloud Composer 1 迁移到 Cloud Composer 2(从 Airflow 1.10.15 到 Airflow 2.2.5),并尝试使用BigQueryToGCSOperator将数据从 BigQuery 加载到 GCS

from airflow.providers.google.cloud.transfers.bigquery_to_gcs import BigQueryToGCSOperator 

# ...

BigQueryToGCSOperator(
    task_id='my-task',
    source_project_dataset_table='my-project-name.dataset-name.table-name',
    destination_cloud_storage_uris=f'gs://my-bucket/another-path/*.jsonl',
    export_format='NEWLINE_DELIMITED_JSON',
    compression=None,
    location='europe-west2'
)

导致以下错误:

[2022-06-07, 11:17:01 UTC] {taskinstance.py:1776} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py", line 141, in execute
    job = hook.get_job(job_id=job_id).to_api_repr()
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 439, in inner_wrapper
    return func(self, *args, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 1492, in get_job
    job = client.get_job(job_id=job_id, project=project_id, location=location)
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 2066, in get_job
    resource = self._call_api(
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 782, in _call_api
    return call()
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
    return retry_target(
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
    raise exceptions.from_http_response(response)

google.api_core.exceptions.NotFound: 404 GET https://bigquery.googleapis.com/bigquery/v2/projects/my-project-name/jobs/airflow_1654592634552749_1896245556bd824c71f31c79d28cdfbe?projection=full&prettyPrint=false: Not found: Job my-project-name:airflow_1654592634552749_1896245556bd824c71f31c79d28cdfbe

任何线索可能是这里的问题以及为什么它在 Airflow 2.2.5 上不起作用(即使等效的BigQueryToCloudStorageOperator在 Airflow 1.10.15 中适用于 Cloud Composer v1)。

显然,这似乎是apache-airflow-providers-google版本v7.0.0中引入的错误。

另请注意,从 BQ 到 GCS 的文件传输实际上会成功(即使任务会失败)。


作为解决方法,您可以恢复到工作版本(如果可能),例如恢复到6.8.0 ,或者使用 BQ API 并摆脱BigQueryToGCSOperator

例如,

from google.cloud import bigquery
from airflow.operators.python import PythonOperator


def load_bq_to_gcs():
    client = bigquery.Client()
    job_config = bigquery.job.ExtractJobConfig()
    job_config.destination_format = bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON

    destination_uri = f"{<gcs-bucket-destination>}*.jsonl"

    dataset_ref = bigquery.DatasetReference(bq_project_name, bq_dataset_name)
    table_ref = dataset_ref.table(bq_table_name)

    extract_job = client.extract_table(
        table_ref,
        destination_uri,
        job_config=job_config,
        location='europe-west2',
    )
    extract_job.result()

然后创建PythonOperator的实例:

PythonOperator(
    task_id='test_task',
    python_callable=load_bq_to_gcs,
)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM