[英]Apache Beam/Dataflow doesn't discard late data from Pub/Sub
I'm having a hard time understand how Beam's windowing is supposed to work related to discarding late data.我很难理解 Beam 的窗口应该如何与丢弃延迟数据相关。
Here is my pipeline:这是我的管道:
pipeline
| ReadFromPubSub()
| WindowInto(
FixedWindow(60),
trigger=AfterWatermark(
early=AfterCount(1000),
late=AfterProcessingTime(60),
),
accumulation_mode=AccumulationMode.DISCARDING,
allowed_lateness=10*60,
)
| GroupByKey()
| SomeTransform()
| Map(print)
How I understand this is that data elements from Pub/Sub are assigned with the timestamp at which they're published to a subscription.我的理解是,来自 Pub/Sub 的数据元素被分配了它们发布到订阅的时间戳。
When I start my pipeline to consume data from Pub/Sub, which also included old data, I expected that only data from the last 10 minutes (as set at the allowed_lateness param of WindowInto) are printed out.当我启动我的管道以使用来自 Pub/Sub 的数据(其中还包括旧数据)时,我预计只会打印出最近 10 分钟的数据(在 WindowInto 的allowed_lateness参数中设置)。 But the result is that all of the data from the subscription are printed out.但结果是订阅的所有数据都被打印出来了。
Am I missing something or do i understand Windowing wrong?我是不是遗漏了什么,或者我对 Windowing 的理解有误?
I use Python and Beam 2.42.0我使用 Python 和 Beam 2.42.0
In the above pipeline, there is no late data.在上面的管道中,没有迟到的数据。
Data is only late when compared to the watermark.数据仅在与水印相比时才晚。 To calculate the watermark, Beam uses the timestamp of the messages being read.为了计算水印,Beam 使用正在读取的消息的时间戳。 In your case, the timestamp of the messages when they are generated in the ReadFromPubsub
transform.在您的情况下,消息在ReadFromPubsub
转换中生成时的时间戳。
Pubsub messages have attributes in addition to the payload.除了有效负载之外,Pubsub 消息还具有属性。 Beam uses these attributes to read and set the timestamp of each message. Beam 使用这些属性来读取和设置每条消息的时间戳。 If you don't specify any, Beam uses the current clock value to set that timestamp.如果您未指定任何内容,Beam 将使用当前时钟值来设置该时间戳。 So in your case you are actually working in processing time (timestamp set to current processing time), and therefore it is impossible to have late data.因此,在您的情况下,您实际上是在处理时间(时间戳设置为当前处理时间),因此不可能有延迟数据。 The watermark will always be very close to the current clock value.水印将始终非常接近当前时钟值。
See this question in SO for more details on how the timestamp attribute is used with Pubsub (for Python, check the parameter timestamp_attribute
in ReadFromPubsub
)有关时间戳属性如何与 Pubsub 一起使用的更多详细信息,请参阅 SO 中的此问题(对于 Python,请检查ReadFromPubsub
中的参数timestamp_attribute
)
If there is no timestamp attribute in the messages you are reading, but they contain a field (in the payload) with the timestamp, you can use it to generate a timestamped value (parsing the payload and then using a DoFn
that emits TimestampedValue
, or the WithTimestamps
transform).如果您正在阅读的消息中没有时间戳属性,但它们包含一个带有时间戳的字段(在负载中),您可以使用它来生成时间戳值(解析负载然后使用发出TimestampedValue
的DoFn
,或者WithTimestamps
转换)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.