简体   繁体   English

Apache Beam - 在管道中添加延迟

[英]Apache Beam - adding a delay into a pipeline

I have a simple pipeline that reads from a Pub Sub topic and writes to BigQuery.我有一个从 Pub Sub 主题读取并写入 BigQuery 的简单管道。 I would like to introduce a 5 minute delay between reading the message from the topic and writing it to BQ.我想在从主题读取消息和将其写入 BQ 之间引入 5 分钟的延迟。

I thought I could do this using a trigger, similarly to this below, however the message still goes straight through with no delay.我想我可以使用触发器来做到这一点,类似于下面的这个,但是消息仍然直接通过,没有延迟。

PCollection<PubsubMessage> windowed_inputEvents =
    inputEvents.apply(
        Window.<PubsubMessage>into(FixedWindows.of(Duration.standardMinutes(1)))                  
              .triggering(
                  AfterProcessingTime
                      .pastFirstElementInPane()
                      .plusDelayOf(Duration.standardMinutes(5)))
              .withAllowedLateness(Duration.standardMinutes(1))
              .discardingFiredPanes());

Is it possible to create such a delay using triggers?是否可以使用触发器创建这样的延迟?

Thanks谢谢

It looks like you are mixing up couple of things.看起来您正在混淆几件事。 In your example you have a fixed window of 1 minute which means that at the end of the window all the data elements that are part of the window is emitted.在您的示例中,您有一个 1 分钟的固定窗口,这意味着在窗口结束时,所有属于该窗口的数据元素都会被发出。

Triggers are basically additional levers that you can leverage to emit data before a window is closed.触发器基本上是额外的杠杆,您可以利用它在窗口关闭之前发出数据。 Triggers cannot hold data post a window period is closed.触发器不能在窗口期关闭后保存数据。 For example if the window is between 12:00 and 12:01 and if the first element comes at 12:00 then at the time when the window is closed at 12:01 the element is emitted, it is not held back till 12:05.例如,如果窗口在 12:00 和 12:01 之间,并且如果第一个元素在 12:00 出现,那么在窗口在 12:01 关闭时该元素被发射,它不会被推迟到 12 点: 05.

To meet your requirements you can do couple of things:-为了满足您的要求,您可以做几件事:-

  1. Increase the size of the window period such that is longer than the retention period and you can then emit the data elements with delay.增加窗口期的大小,使其长于保留期,然后您可以延迟发送数据元素。
  2. If this is not possible in BigqueryIO there is a FILE_LOADS method which you can leverage to write data into Bigquery in batches and this API can support a time duration as well using withTriggeringFrequency .如果这在 BigqueryIO 中无法实现,则可以使用 FILE_LOADS 方法将数据批量写入 Bigquery,并且此 API 也可以使用withTriggeringFrequency支持持续时间。 More details can be found here - https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html#withTriggeringFrequency-org.joda.time.Duration-更多细节可以在这里找到 - https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html#withTriggeringFrequency-org.joda .time.Duration-

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM