GCP - Bigquery 到 Kafka 作為流

Question

我有一個在 gcp 中運行的數據流應用程序（java），能夠從 bigquery 表讀取數據並寫入 Kafka。 但是應用程序以批處理模式運行，我希望將應用程序設置為 stream 以從 bigquery 表連續讀取數據並寫入 kafka 主題。

Bigquery 表：帶有 insert_time（插入表的記錄的時間戳）和消息列的分區表

 PCollection<TableRow> tablesRows = BigQueryUtil.readFromTable(pipeline,
"select message,processed from `myprojectid.mydatasetname.mytablename` " +
                            "where processed = false " +
                            "order by insert_time desc ")
                            .apply("Windowing",Window.into(FixedWindows.of(Duration.standardMinutes(1))));

.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn()))
    .apply("Writing Messages", KafkaIO.<String, String>write().
                            withBootstrapServers(bootStrapURLs).
                            withTopic(options.getKafkaInputTopics()).
                            withKeySerializer(StringSerializer.class).
                            withValueSerializer(StringSerializer.class).
                                    withProducerFactoryFn(new ProducerFactoryFn(sslConfig, projected))
                    );
 pipeline.run();

注意：我嘗試了以下選項但還沒有成功

選項 1. 我嘗試了 options.streaming (true) 的選項； 它以 stream 的形式運行，但它將在第一次成功寫入時完成。

選項 2. 應用觸發器

Window.into(
                            FixedWindows.of(Duration.standardMinutes(5)))
                    .triggering(
                            AfterWatermark.pastEndOfWindow()
                                    .withLateFirings(AfterPane.elementCountAtLeast(1)))
                    .withAllowedLateness(Duration.standardDays(2))
                    .accumulatingFiredPanes();

方案三、強制制作unbounded

     WindowingStrategy<?, ?> windowingStrategy = tablesRows.setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED).getWindowingStrategy();
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn())).setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED)

任何解決方案表示贊賞。

Answer 1

Beam Programming Guide 中的Side Input Patterns中的一些建議在這里可能會有幫助，即使您沒有將其用作輔助輸入。 特別是，該文章討論了使用 GenerateSequence 定期發出一個值並觸發從有界源讀取。

這可以讓您的一次性查詢成為定期發出新記錄的重復查詢。 不過，由您的查詢邏輯決定每次查詢要掃描的表范圍，我預計很難避免發出重復記錄。 希望您的用例可以容忍這一點。

發射到全球 window 看起來像：

    PCollectionView<Map<String, String>> map =
        p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
            .apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))))
            .apply(Sum.longsGlobally().withoutDefaults())
            .apply(
                ParDo.of(
                    new DoFn<Long, Map<String, String>>() {

                      @ProcessElement
                      public void process(
                          @Element Long input,
                          @Timestamp Instant timestamp,
                          OutputReceiver<Map<String, String>> o) {
                        // Read from BigQuery here and for each row output a record:                      o.output(PlaceholderExternalService.readTestData(timestamp)
);
                      }
                    }))
            .apply(
                Window.<Map<String, String>>into(new GlobalWindows())
                    .triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
                    .discardingFiredPanes())
            .apply(View.asSingleton());

這假設查詢結果的大小相對較小，因為讀取完全發生在 DoFn 調用中。

GCP - Bigquery 到 Kafka 作為流

問題描述

1 個解決方案

解決方案1
0 2023-01-11 14:18:20

GCP - Bigquery 到 Kafka 作為流

問題描述

1 個解決方案

解決方案1 0 2023-01-11 14:18:20

解決方案1
0 2023-01-11 14:18:20