简体   繁体   中英

Apache Beam/Dataflow doesn't discard late data from Pub/Sub

I'm having a hard time understand how Beam's windowing is supposed to work related to discarding late data.

Here is my pipeline:

pipeline
| ReadFromPubSub() 
| WindowInto(
    FixedWindow(60), 
    trigger=AfterWatermark(
      early=AfterCount(1000),
      late=AfterProcessingTime(60),
    ),
    accumulation_mode=AccumulationMode.DISCARDING,
    allowed_lateness=10*60,
  )
| GroupByKey()
| SomeTransform()
| Map(print)

How I understand this is that data elements from Pub/Sub are assigned with the timestamp at which they're published to a subscription.

When I start my pipeline to consume data from Pub/Sub, which also included old data, I expected that only data from the last 10 minutes (as set at the allowed_lateness param of WindowInto) are printed out. But the result is that all of the data from the subscription are printed out.

Am I missing something or do i understand Windowing wrong?

I use Python and Beam 2.42.0

In the above pipeline, there is no late data.

Data is only late when compared to the watermark. To calculate the watermark, Beam uses the timestamp of the messages being read. In your case, the timestamp of the messages when they are generated in the ReadFromPubsub transform.

Pubsub messages have attributes in addition to the payload. Beam uses these attributes to read and set the timestamp of each message. If you don't specify any, Beam uses the current clock value to set that timestamp. So in your case you are actually working in processing time (timestamp set to current processing time), and therefore it is impossible to have late data. The watermark will always be very close to the current clock value.

See this question in SO for more details on how the timestamp attribute is used with Pubsub (for Python, check the parameter timestamp_attribute in ReadFromPubsub )

If there is no timestamp attribute in the messages you are reading, but they contain a field (in the payload) with the timestamp, you can use it to generate a timestamped value (parsing the payload and then using a DoFn that emits TimestampedValue , or the WithTimestamps transform).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM