简体繁体 English

通过流式传输 Pubsub 数据丰富固定 BigQuery 数据的数据流模板/模式

[英]Dataflow Template/Pattern in enriching fixed BigQuery data by streaming Pubsub data

原文 2021-01-23 09:59:40 8 1 java/ google-bigquery/ google-cloud-dataflow/ apache-beam/ google-cloud-pubsub

I have a BigQuery dimension table (which doesn't change much) and a streaming JSON data from PubSub.我有一个 BigQuery 维度表（变化不大）和来自 PubSub 的流式 JSON 数据。 What I want to do is to query this dimension table, and enrich the data by joining on the incoming data from PubSub, then write those streams of joined data to another BigQuery table.我想要做的是查询这个维度表，并通过加入来自 PubSub 的传入数据来丰富数据，然后将这些加入的数据流写入另一个 BigQuery 表。

As I am new to Dataflow/Beam and the concept is still not that clear to me (or at least I have difficulty starting to write the code), I have a number of questions:由于我是 Dataflow/Beam 的新手，并且这个概念对我来说仍然不是很清楚（或者至少我开始编写代码有困难），所以我有很多问题：

What is best template or pattern I can use to do that?我可以用什么最好的模板或模式来做到这一点？ Should I do a PTransform of BigQuery first (followed by PTransform of PubSub) or the PTransform of PubSub first?我应该先做 BigQuery 的 PTransform（然后是 PubSub 的 PTransform）还是先做 PubSub 的 PTransform？
How can I do the join?我该如何加入？ Like ParDo.of(...).withSideInputs(PCollectionView<Map<String, String>> map) ?像ParDo.of(...).withSideInputs(PCollectionView<Map<String, String>> map) ？
What is the best window setting for the PubSub? PubSub 的最佳 window 设置是什么？ Is it correct that the window setting for the PTransform part of BigQuery is different from the PTransform part of the Pubsub one? BigQuery 的 PTransform 部分的 window 设置与 Pubsub 的 PTransform 部分不同是否正确？

1 个解决方案

You need to join two PCollections.您需要加入两个 PCollection。

A PCollection that contains data from Pub/Sub.包含来自 Pub/Sub 的数据的 PCollection。 This can be created by using the PubSubIO.Read PTransform .这可以通过使用PubSubIO.Read PTransform创建。
A PCollection that contains data from BigQuery.包含来自 BigQuery 的数据的 PCollection。 If data is static, BigQueryIO.Read transform can be used.如果数据是 static，则可以使用BigQueryIO.Read转换。 If data can change though, the current BigQuery transforms available in Beam probably will not work.但是，如果数据可以更改，那么 Beam 中当前可用的 BigQuery 转换可能无法正常工作。 One option might be to use transform PeriodicImpulse and your own ParDo to create a periodically changing input.一种选择可能是使用转换PeriodicImpulse和您自己的ParDo来创建周期性变化的输入。 See here for an example (please note that PeriodicImpulse transform was added recently).有关示例，请参见此处（请注意，最近添加了PeriodicImpulse变换）。

You can combine the data in a ParDo where PCollection (1) is the main input and PCollection (2) is a side input (similar to the example above).您可以在ParDo中组合数据，其中PCollection (1) 是主要输入， PCollection (2) 是辅助输入（类似于上面的示例）。

Finally you can stream output to BigQuery using the BigQueryIO.Write transform.最后，您可以使用BigQueryIO.Write转换将 stream output 转换为 BigQuery。