I have a Beam pipeline defined as:
PCollectionList.of(mycollection1).and(mycollection2)
.apply(new MyTransform())
.apply(BigQueryIO.write()
.to("my_result_table")
.withSchema()
.withFormatFunction()
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
.withNumStorageWriteApiStreams(10)
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransietErrors())
.withKmsKey(key)
.withCreateDisposition(CREATE_IF_NEEDED)
.withWriteDisposition(WRITE_TRUNCATE)
.withCustomGcsTempLocation(ValueProvider.StaticValueProvider.of(tempLocation)))
);
This pipeline is run on Google Dataflow.
It works fine if MyTransform produces some results. However, the business logic of MyTransform allows it to produce a PCollection of no result inside. If this happens, I would like to have am empty BigQuery table named my_result_table.
It seems that Dataflow will skip the BigQueryIO completely if MyTransform produces an empty PCollection so that no BigQuery table is produced.
Is there any ways I can force BigQuery to create an empty table if MyTransform produces empty PCollection?
You can do a Count.globally() to check the size of the output of MyTransform and use the check as a side input with the output of MyTransform. If the side input is 0, use the Python client library to create an empty BQ table and yield an empty output; otherwise, continue the pipeline to use the BigQueryIO.
If you don't care about writing a dummy line into your table, you can yield a dummy line when the size side input is 0 and pass it to the same BigQueryIO without using the client library.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.