Apache 如果没有要写入的数据，Beam 不会创建 BigQuery 表

Question

我有一个 Beam 管道定义为：

PCollectionList.of(mycollection1).and(mycollection2)
    .apply(new MyTransform())
    .apply(BigQueryIO.write()
           .to("my_result_table")
           .withSchema()
           .withFormatFunction()
           .withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
           .withNumStorageWriteApiStreams(10)
           .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransietErrors())
           .withKmsKey(key)
           .withCreateDisposition(CREATE_IF_NEEDED)
           .withWriteDisposition(WRITE_TRUNCATE)
           .withCustomGcsTempLocation(ValueProvider.StaticValueProvider.of(tempLocation)))
     );

此管道在 Google Dataflow 上运行。

如果 MyTransform 产生一些结果，它工作正常。 但是，MyTransform 的业务逻辑允许它产生一个内部没有结果的 PCollection。 如果发生这种情况，我希望有一个名为 my_result_table 的空 BigQuery 表。

如果 MyTransform 生成一个空的 PCollection，那么 Dataflow 似乎将完全跳过 BigQueryIO，这样就不会生成任何 BigQuery 表。

如果 MyTransform 生成空的 PCollection，有什么方法可以强制 BigQuery 创建一个空表？

Answer 1

您可以执行 Count.globally() 来检查 MyTransform 的 output 的大小，并将检查用作 MyTransform 的 output 的侧输入。 如果侧输入为0，则使用Python 客户端库创建一个空的BQ表，产生一个空的output； 否则，继续管道以使用 BigQueryIO。

如果您不关心将虚拟行写入表中，则可以在大小端输入为 0 时生成虚拟行，并将其传递给相同的 BigQueryIO 而无需使用客户端库。

Apache 如果没有要写入的数据，Beam 不会创建 BigQuery 表

问题描述

1 个解决方案

解决方案1
0 2022-06-07 18:16:23

Apache 如果没有要写入的数据，Beam 不会创建 BigQuery 表

问题描述

1 个解决方案

解决方案1 0 2022-06-07 18:16:23

解决方案1
0 2022-06-07 18:16:23