简体   繁体   中英

Apache Beam does not create BigQuery table if no data is to write

I have a Beam pipeline defined as:

PCollectionList.of(mycollection1).and(mycollection2)
    .apply(new MyTransform())
    .apply(BigQueryIO.write()
           .to("my_result_table")
           .withSchema()
           .withFormatFunction()
           .withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
           .withNumStorageWriteApiStreams(10)
           .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransietErrors())
           .withKmsKey(key)
           .withCreateDisposition(CREATE_IF_NEEDED)
           .withWriteDisposition(WRITE_TRUNCATE)
           .withCustomGcsTempLocation(ValueProvider.StaticValueProvider.of(tempLocation)))
     );

This pipeline is run on Google Dataflow.

It works fine if MyTransform produces some results. However, the business logic of MyTransform allows it to produce a PCollection of no result inside. If this happens, I would like to have am empty BigQuery table named my_result_table.

It seems that Dataflow will skip the BigQueryIO completely if MyTransform produces an empty PCollection so that no BigQuery table is produced.

Is there any ways I can force BigQuery to create an empty table if MyTransform produces empty PCollection?

You can do a Count.globally() to check the size of the output of MyTransform and use the check as a side input with the output of MyTransform. If the side input is 0, use the Python client library to create an empty BQ table and yield an empty output; otherwise, continue the pipeline to use the BigQueryIO.

If you don't care about writing a dummy line into your table, you can yield a dummy line when the size side input is 0 and pass it to the same BigQueryIO without using the client library.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM