簡體   English   中英

Airflow 2:將數據從 BigQuery 傳輸到 Cloud Storage 時找不到作業

[英]Airflow 2: Job Not Found when transferring data from BigQuery into Cloud Storage

我正在嘗試從 Cloud Composer 1 遷移到 Cloud Composer 2(從 Airflow 1.10.15 到 Airflow 2.2.5),並嘗試使用BigQueryToGCSOperator將數據從 BigQuery 加載到 GCS

from airflow.providers.google.cloud.transfers.bigquery_to_gcs import BigQueryToGCSOperator 

# ...

BigQueryToGCSOperator(
    task_id='my-task',
    source_project_dataset_table='my-project-name.dataset-name.table-name',
    destination_cloud_storage_uris=f'gs://my-bucket/another-path/*.jsonl',
    export_format='NEWLINE_DELIMITED_JSON',
    compression=None,
    location='europe-west2'
)

導致以下錯誤:

[2022-06-07, 11:17:01 UTC] {taskinstance.py:1776} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py", line 141, in execute
    job = hook.get_job(job_id=job_id).to_api_repr()
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 439, in inner_wrapper
    return func(self, *args, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 1492, in get_job
    job = client.get_job(job_id=job_id, project=project_id, location=location)
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 2066, in get_job
    resource = self._call_api(
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 782, in _call_api
    return call()
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
    return retry_target(
  File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
    raise exceptions.from_http_response(response)

google.api_core.exceptions.NotFound: 404 GET https://bigquery.googleapis.com/bigquery/v2/projects/my-project-name/jobs/airflow_1654592634552749_1896245556bd824c71f31c79d28cdfbe?projection=full&prettyPrint=false: Not found: Job my-project-name:airflow_1654592634552749_1896245556bd824c71f31c79d28cdfbe

任何線索可能是這里的問題以及為什么它在 Airflow 2.2.5 上不起作用(即使等效的BigQueryToCloudStorageOperator在 Airflow 1.10.15 中適用於 Cloud Composer v1)。

顯然,這似乎是apache-airflow-providers-google版本v7.0.0中引入的錯誤。

另請注意,從 BQ 到 GCS 的文件傳輸實際上會成功(即使任務會失敗)。


作為解決方法,您可以恢復到工作版本(如果可能),例如恢復到6.8.0 ,或者使用 BQ API 並擺脫BigQueryToGCSOperator

例如,

from google.cloud import bigquery
from airflow.operators.python import PythonOperator


def load_bq_to_gcs():
    client = bigquery.Client()
    job_config = bigquery.job.ExtractJobConfig()
    job_config.destination_format = bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON

    destination_uri = f"{<gcs-bucket-destination>}*.jsonl"

    dataset_ref = bigquery.DatasetReference(bq_project_name, bq_dataset_name)
    table_ref = dataset_ref.table(bq_table_name)

    extract_job = client.extract_table(
        table_ref,
        destination_uri,
        job_config=job_config,
        location='europe-west2',
    )
    extract_job.result()

然后創建PythonOperator的實例:

PythonOperator(
    task_id='test_task',
    python_callable=load_bq_to_gcs,
)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM