[英]GCP - Bigquery to Kafka as streaming
I have a dataflow application(java) which is running in gcp and able to read the data from bigquery table and write to Kafka.我有一个在 gcp 中运行的数据流应用程序(java),能够从 bigquery 表读取数据并写入 Kafka。 But the application running as a batch mode, where as I would like make application as stream to read the data continuously from bigquery table and write to kafka topic.
但是应用程序以批处理模式运行,我希望将应用程序设置为 stream 以从 bigquery 表连续读取数据并写入 kafka 主题。
Bigquery Table: Partitioned table with insert_time ( timestamp of record inserted intable) and message column Bigquery 表:带有 insert_time(插入表的记录的时间戳)和消息列的分区表
PCollection<TableRow> tablesRows = BigQueryUtil.readFromTable(pipeline,
"select message,processed from `myprojectid.mydatasetname.mytablename` " +
"where processed = false " +
"order by insert_time desc ")
.apply("Windowing",Window.into(FixedWindows.of(Duration.standardMinutes(1))));
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn()))
.apply("Writing Messages", KafkaIO.<String, String>write().
withBootstrapServers(bootStrapURLs).
withTopic(options.getKafkaInputTopics()).
withKeySerializer(StringSerializer.class).
withValueSerializer(StringSerializer.class).
withProducerFactoryFn(new ProducerFactoryFn(sslConfig, projected))
);
pipeline.run();
Note: I have tried below options but no luck yet注意:我尝试了以下选项但还没有成功
Options 1. I tried the options of options.streaming (true);选项 1. 我尝试了 options.streaming (true) 的选项; its running as stream but it will finish on the first success write.
它以 stream 的形式运行,但它将在第一次成功写入时完成。
Options 2. Applied trigger选项 2. 应用触发器
Window.into(
FixedWindows.of(Duration.standardMinutes(5)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardDays(2))
.accumulatingFiredPanes();
Option 3. Making unbounded forcibly方案三、强制制作unbounded
WindowingStrategy<?, ?> windowingStrategy = tablesRows.setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED).getWindowingStrategy();
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn())).setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED)
Any solution is appreciated.任何解决方案表示赞赏。
Some of the advice in Side Input Patterns in the Beam Programming Guide may be helpful here, even though you aren't using this as a side input. Beam Programming Guide 中的Side Input Patterns中的一些建议在这里可能会有帮助,即使您没有将其用作辅助输入。 In particular, that article discusses using GenerateSequence to periodically emit a value and trigger a read from a bounded source.
特别是,该文章讨论了使用 GenerateSequence 定期发出一个值并触发从有界源读取。
This could allow your one time query to become a repeated query that periodically emits new records.这可以让您的一次性查询成为定期发出新记录的重复查询。 It will be up to your query logic to determine what range of the table to scan on each query, though, and I expect it will be difficult to avoid emitting duplicate records.
不过,由您的查询逻辑决定每次查询要扫描的表范围,我预计很难避免发出重复记录。 Hopefully your use case can tolerate that.
希望您的用例可以容忍这一点。
Emitting into the global window would look like:发射到全球 window 看起来像:
PCollectionView<Map<String, String>> map =
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(
ParDo.of(
new DoFn<Long, Map<String, String>>() {
@ProcessElement
public void process(
@Element Long input,
@Timestamp Instant timestamp,
OutputReceiver<Map<String, String>> o) {
// Read from BigQuery here and for each row output a record: o.output(PlaceholderExternalService.readTestData(timestamp)
);
}
}))
.apply(
Window.<Map<String, String>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(View.asSingleton());
This assumes that the size of the query result is relatively small, since the read happens entirely within a DoFn invocation.这假设查询结果的大小相对较小,因为读取完全发生在 DoFn 调用中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.