简体   繁体   English

Apache Beam/Dataflow 不会丢弃来自 Pub/Sub 的延迟数据

[英]Apache Beam/Dataflow doesn't discard late data from Pub/Sub

I'm having a hard time understand how Beam's windowing is supposed to work related to discarding late data.我很难理解 Beam 的窗口应该如何与丢弃延迟数据相关。

Here is my pipeline:这是我的管道:

pipeline
| ReadFromPubSub() 
| WindowInto(
    FixedWindow(60), 
    trigger=AfterWatermark(
      early=AfterCount(1000),
      late=AfterProcessingTime(60),
    ),
    accumulation_mode=AccumulationMode.DISCARDING,
    allowed_lateness=10*60,
  )
| GroupByKey()
| SomeTransform()
| Map(print)

How I understand this is that data elements from Pub/Sub are assigned with the timestamp at which they're published to a subscription.我的理解是,来自 Pub/Sub 的数据元素被分配了它们发布到订阅的时间戳。

When I start my pipeline to consume data from Pub/Sub, which also included old data, I expected that only data from the last 10 minutes (as set at the allowed_lateness param of WindowInto) are printed out.当我启动我的管道以使用来自 Pub/Sub 的数据(其中还包括旧数据)时,我预计只会打印出最近 10 分钟的数据(在 WindowInto 的allowed_lateness参数中设置)。 But the result is that all of the data from the subscription are printed out.但结果是订阅的所有数据都被打印出来了。

Am I missing something or do i understand Windowing wrong?我是不是遗漏了什么,或者我对 Windowing 的理解有误?

I use Python and Beam 2.42.0我使用 Python 和 Beam 2.42.0

In the above pipeline, there is no late data.在上面的管道中,没有迟到的数据。

Data is only late when compared to the watermark.数据仅在与水印相比时才晚。 To calculate the watermark, Beam uses the timestamp of the messages being read.为了计算水印,Beam 使用正在读取的消息的时间戳。 In your case, the timestamp of the messages when they are generated in the ReadFromPubsub transform.在您的情况下,消息在ReadFromPubsub转换中生成时的时间戳。

Pubsub messages have attributes in addition to the payload.除了有效负载之外,Pubsub 消息还具有属性。 Beam uses these attributes to read and set the timestamp of each message. Beam 使用这些属性来读取和设置每条消息的时间戳。 If you don't specify any, Beam uses the current clock value to set that timestamp.如果您未指定任何内容,Beam 将使用当前时钟值来设置该时间戳。 So in your case you are actually working in processing time (timestamp set to current processing time), and therefore it is impossible to have late data.因此,在您的情况下,您实际上是在处理时间(时间戳设置为当前处理时间),因此不可能有延迟数据。 The watermark will always be very close to the current clock value.水印将始终非常接近当前时钟值。

See this question in SO for more details on how the timestamp attribute is used with Pubsub (for Python, check the parameter timestamp_attribute in ReadFromPubsub )有关时间戳属性如何与 Pubsub 一起使用的更多详细信息,请参阅 SO 中的此问题(对于 Python,请检查ReadFromPubsub中的参数timestamp_attribute

If there is no timestamp attribute in the messages you are reading, but they contain a field (in the payload) with the timestamp, you can use it to generate a timestamped value (parsing the payload and then using a DoFn that emits TimestampedValue , or the WithTimestamps transform).如果您正在阅读的消息中没有时间戳属性,但它们包含一个带有时间戳的字段(在负载中),您可以使用它来生成时间戳值(解析负载然后使用发出TimestampedValueDoFn ,或者WithTimestamps转换)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Dataflow 流中处理来自 Pub/Sub 消息的文件 - Process file from a Pub/Sub message in Dataflow streaming 侧输入数据未更新 - Python Apache Beam - Side Input data doesn't get updated - Python Apache Beam 如何处理 Google Dataflow 中缺少来自 Pub/Sub 的消息 - How to handle lack of messages from Pub/Sub in Google Dataflow Apache Beam 数据流管道使用 Bazel 构建和部署 - Apache Beam Dataflow pipeline build and deploy with Bazel 无法使用 Dataflow Apache Beam 沉入 BigQuery - Can not sink to BigQuery using Dataflow Apache Beam 数据流-动态创建配置 Apache Beam - Dataflow- dynamic create disposition Apache Beam 如何从谷歌数据流 apache 光束 python 中的 GCS 存储桶中读取多个 JSON 文件 - How to read multiple JSON files from GCS bucket in google dataflow apache beam python Spring Cloud Dataflow 与 Apache Beam/GCP 数据流说明 - Spring Cloud Dataflow vs Apache Beam/GCP Dataflow Clarification 如何为 Apache Beam/Dataflow 经典模板(Python)和数据管道实现 CI/CD 管道 - How to implement a CI/CD pipeline for Apache Beam/Dataflow classic templates (Python) & data pipelines 在 Google Cloud Dataflow 上运行的 Apache Beam 中禁用特定 class 的日志记录 - Disable logging from a specific class in Apache Beam running on Google Cloud Dataflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM