简体   繁体   English

通过流式传输 Pubsub 数据丰富固定 BigQuery 数据的数据流模板/模式

[英]Dataflow Template/Pattern in enriching fixed BigQuery data by streaming Pubsub data

I have a BigQuery dimension table (which doesn't change much) and a streaming JSON data from PubSub.我有一个 BigQuery 维度表(变化不大)和来自 PubSub 的流式 JSON 数据。 What I want to do is to query this dimension table, and enrich the data by joining on the incoming data from PubSub, then write those streams of joined data to another BigQuery table.我想要做的是查询这个维度表,并通过加入来自 PubSub 的传入数据来丰富数据,然后将这些加入的数据流写入另一个 BigQuery 表。

As I am new to Dataflow/Beam and the concept is still not that clear to me (or at least I have difficulty starting to write the code), I have a number of questions:由于我是 Dataflow/Beam 的新手,并且这个概念对我来说仍然不是很清楚(或者至少我开始编写代码有困难),所以我有很多问题:

  1. What is best template or pattern I can use to do that?我可以用什么最好的模板或模式来做到这一点? Should I do a PTransform of BigQuery first (followed by PTransform of PubSub) or the PTransform of PubSub first?我应该先做 BigQuery 的 PTransform(然后是 PubSub 的 PTransform)还是先做 PubSub 的 PTransform?
  2. How can I do the join?我该如何加入? Like ParDo.of(...).withSideInputs(PCollectionView<Map<String, String>> map) ?ParDo.of(...).withSideInputs(PCollectionView<Map<String, String>> map)
  3. What is the best window setting for the PubSub? PubSub 的最佳 window 设置是什么? Is it correct that the window setting for the PTransform part of BigQuery is different from the PTransform part of the Pubsub one? BigQuery 的 PTransform 部分的 window 设置与 Pubsub 的 PTransform 部分不同是否正确?

You need to join two PCollections.您需要加入两个 PCollection。

  1. A PCollection that contains data from Pub/Sub.包含来自 Pub/Sub 的数据的 PCollection。 This can be created by using the PubSubIO.Read PTransform .这可以通过使用PubSubIO.Read PTransform创建。
  2. A PCollection that contains data from BigQuery.包含来自 BigQuery 的数据的 PCollection。 If data is static, BigQueryIO.Read transform can be used.如果数据是 static,则可以使用BigQueryIO.Read转换。 If data can change though, the current BigQuery transforms available in Beam probably will not work.但是,如果数据可以更改,那么 Beam 中当前可用的 BigQuery 转换可能无法正常工作。 One option might be to use transform PeriodicImpulse and your own ParDo to create a periodically changing input.一种选择可能是使用转换PeriodicImpulse和您自己的ParDo来创建周期性变化的输入。 See here for an example (please note that PeriodicImpulse transform was added recently).有关示例,请参见此处(请注意,最近添加了PeriodicImpulse变换)。

You can combine the data in a ParDo where PCollection (1) is the main input and PCollection (2) is a side input (similar to the example above).您可以在ParDo中组合数据,其中PCollection (1) 是主要输入, PCollection (2) 是辅助输入(类似于上面的示例)。

Finally you can stream output to BigQuery using the BigQueryIO.Write transform.最后,您可以使用BigQueryIO.Write转换将 stream output 转换为 BigQuery。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 云数据流,PubSub和Bigquery问题 - Cloud Dataflow, PubSub & Bigquery Issues 使用DataFlow从多个PubSub主题流到BigQuery时,邮件卡在GBP中吗? - Messages stuck in GBP when streaming from multiple PubSub topics to BigQuery using DataFlow? 如何使用Dataflow的jdbc to Bigquery模板将数据从Oracle 11g第2版传输到Bigquery? - How to transfer data from Oracle 11g release 2 to Bigquery using Dataflow's jdbc to Bigquery template? Cloud Dataflow,PubSub和Bigquery(TableRowJsonCoder)问题 - Cloud Dataflow, PubSub & Bigquery (TableRowJsonCoder) Issues 数据流在 BigQuery 写入完成后发送 PubSub 消息 - Dataflow send PubSub message after BigQuery write completion 如何将流数据与 Dataflow/Beam 中的大型历史数据集相结合 - How to combine streaming data with large history data set in Dataflow/Beam 将数据流式传输到BigQuery时出现Java代码错误 - Java Code error while Streaming Data into BigQuery 使用 Clud Dataflow 将数据从 Google Cloud Sql 读取到 BigQuery - Read the data from Google Cloud Sql to BigQuery using Clud Dataflow 清空从PubSub读取并写入Google Cloud Storage的Dataflow作业时的数据丢失 - Data loss when draining Dataflow job that reads from PubSub and writes to Google Cloud Storage 使用Google Cloud Dataflow SDK读取流数据 - Reading streaming data using Google Cloud Dataflow SDK
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM