简体   繁体   English

BigQuery 外部表运算符使用错误的架构路径

[英]BigQuery external table operator use wrong schema path

Here is a snippet from a DAG that I am working on这是我正在研究的 DAG 的片段

create_ext_table = bigquery_operator.BigQueryCreateExternalTableOperator(
    task_id='create_ext_table',
    bucket='bucket-a',
    source_objects='path/*',
    schema_object='bucket-b/data/schema.json',
    destination_project_dataset_table='sandbox.write_to_BQ',
    source_format='CSV',
    field_delimiter=';')

create_ext_table

When I run the code, I am getting the following error on Composer 1.10.10+composer :当我运行代码时,我在 Composer 1.10.10+composer 上收到以下错误:

404 GET https://storage.googleapis.com/download/storage/v1/b/bucket-a/o/bucket-b%2Fdata%2Fschema.json?alt=media: (u'Request failed with status code', 404, u'Expected one of', 200, 206)

As seen in the error, airflow concat the bucket param with the schema_objet param ... Is there any workaround with this ?正如错误中所见,气流将桶参数与 schema_objet 参数连接起来......有什么解决方法吗? Because I cannot store the table schema and the table files in the same bucket因为我无法将表架构和表文件存储在同一个存储桶中

Thanks谢谢

This is expected as you can see in the source code for the operator here that we use the bucket argument to get the schema_object , so the operator assumes you have them in the same bucket.正如您在此处操作符的源代码中看到的那样,这是预期的,我们使用bucket参数来获取schema_object ,因此操作符假设您将它们放在同一个存储桶中。

As you mentioned you cannot store them there are a few workarounds that you can try, I'll speak to them at a high level:正如您提到的,您无法存储它们,您可以尝试一些解决方法,我将在高层次上与他们交谈:

  1. You can extend the operator and override the execute method in which you retrieve the data from the bucket you care about您可以扩展操作符并覆盖从您关心的存储桶中检索数据的execute方法
  2. You can add an upstream task to move the schema object to bucket-a using GoogleCloudStorageToGoogleCloudStorageOperator .您可以添加上游任务以使用GoogleCloudStorageToGoogleCloudStorageOperator将架构对象移动到bucket-a This requires handling the schema_object different from the way the source code handles it.这需要以不同于源代码处理方式的方式处理schema_object Namely parsing it for the bucket name and object path then retrieving it.即解析它的存储桶名称和对象路径,然后检索它。 Alternatively you can create your own argument (something like schema_bucket ) and use it in a similar manner.或者,您可以创建自己的参数(类似于schema_bucket )并以类似的方式使用它。
    1. You can also delete this object using GoogleCloudStorageDeleteOperator as a downstream task after creating the external table so it does not have to be persisted in `bucket您还可以在创建外部表后使用GoogleCloudStorageDeleteOperator作为下游任务删除此对象,因此它不必保留在`bucket 中

Final note on the schema_object argument, it's meant to be the GCS path as it uses the same bucket , so if you use the already defined operator it should be schema_object='data/schema.json',关于schema_object参数的最后说明,它是 GCS 路径,因为它使用相同的bucket ,所以如果你使用已经定义的运算符,它应该是schema_object='data/schema.json',

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM