简体   繁体   中英

Dataflow Template/Pattern in enriching fixed BigQuery data by streaming Pubsub data

I have a BigQuery dimension table (which doesn't change much) and a streaming JSON data from PubSub. What I want to do is to query this dimension table, and enrich the data by joining on the incoming data from PubSub, then write those streams of joined data to another BigQuery table.

As I am new to Dataflow/Beam and the concept is still not that clear to me (or at least I have difficulty starting to write the code), I have a number of questions:

  1. What is best template or pattern I can use to do that? Should I do a PTransform of BigQuery first (followed by PTransform of PubSub) or the PTransform of PubSub first?
  2. How can I do the join? Like ParDo.of(...).withSideInputs(PCollectionView<Map<String, String>> map) ?
  3. What is the best window setting for the PubSub? Is it correct that the window setting for the PTransform part of BigQuery is different from the PTransform part of the Pubsub one?

You need to join two PCollections.

  1. A PCollection that contains data from Pub/Sub. This can be created by using the PubSubIO.Read PTransform .
  2. A PCollection that contains data from BigQuery. If data is static, BigQueryIO.Read transform can be used. If data can change though, the current BigQuery transforms available in Beam probably will not work. One option might be to use transform PeriodicImpulse and your own ParDo to create a periodically changing input. See here for an example (please note that PeriodicImpulse transform was added recently).

You can combine the data in a ParDo where PCollection (1) is the main input and PCollection (2) is a side input (similar to the example above).

Finally you can stream output to BigQuery using the BigQueryIO.Write transform.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM