简体   繁体   English

如何在流传输管道中添加重复数据删除[apache-beam]

[英]How to add de-duplication to a streaming pipeline [apache-beam]

I have a working streaming pipeline in apache beam [python] that ingests data from pub/sub, performs enrichment in dataflow and passes it to big-query. 我在apache beam [python]中有一个正在工作的流传输管道,该管道从pub / sub提取数据,对数据流进行扩充,然后将其传递给big-query。

Withing the streaming window, I would like to ensure that messages are not getting duplicated, (as pub/sub guarantees only at least once delivery). 对于流式传输窗口,我想确保消息不会重复(因为pub / sub至少保证一次传递)。

So, I figured I'd just use the distinct method from beam, but as soon as I use it my pipeline breaks (can't proceed any further, any local prints are also not visible). 因此,我认为我只会使用与Beam不同的方法,但是一旦使用它,我的管道就会中断(无法继续进行,任何本地打印内容也不可见)。

Here is my pipeline code: 这是我的管道代码:

    with beam.Pipeline(options=options) as p:
        message = (p | "ReadFromPubSub" >> beam.io.ReadFromPubSub(topic=known_args.topic).
                   with_output_types(bytes))

        bq_data = (message | "Decode" >> beam.FlatMap(lambda x: [x.decode('utf-8')])
                           | "Deduplication" >> beam.Distinct()
                           | "JSONLoad" >> beam.ParDo(ReadAsJSON())
                           | "Windowing" >> beam.WindowInto(window.FixedWindows(10, 0))
                           | "KeepRelevantData" >> beam.ParDo(KeepRelevantData())
                           | "PreProcessing" >> beam.ParDo(PreProcessing())
                           | "SendLimitedKeys" >> beam.ParDo(SendLimitedKeys(), schema=schema)
                   )

        if not known_args.local:
            bq_data | "WriteToBigQuery" >> beam.io.WriteToBigQuery(table=known_args.bq_table, schema=schema)
        else:
            bq_data | "Display" >> beam.ParDo(Display())

As you can see in the de-duplication label, I'm calling the beam.Distinct method. 如您在重复数据删除标签中看到的,我正在调用beam.Distinct方法。

Questions: 问题:

  1. Where should de-duplication happen in the pipeline? 重复数据删除应在管道中的何处进行?

  2. Is this even a correct/sane approach? 这甚至是正确/理智的方法吗?

  3. How else can I de-duplicate streaming buffer data? 我还能如何删除流缓冲区数据的重复数据?

  4. Is de-duplication even required, or am I just wasting my time? 是否需要重复数据删除,还是我只是在浪费时间?

Any solutions or suggestions would be highly appreciated. 任何解决方案或建议将不胜感激。 Thanks. 谢谢。

You may find this blog on Exactly-once processing helpful. 您可能会发现有关一次处理的博客很有帮助。 First, Dataflow is already performing deduplication based on the pub/sub record id. 首先,Dataflow已经基于发布/订阅记录ID执行重复数据删除。 However, as the blog states: "In some cases however, this is not quite enough. The user's publishing process might retry publishes". 但是,如博客所述:“但是,在某些情况下,这还不够。用户的发布过程可能会重试发布”。

So, if your system which is publishing messages to Pub/Sub may be publishing the same message multiple times, then you may wish to add your own deterministic record id. 因此,如果将消息发布到Pub / Sub的系统可能多次发布同一条消息,则您可能希望添加自己的确定性记录ID。 Then Cloud Dataflow will detect these. 然后,Cloud Dataflow将检测到这些。 This is the approach I would recommend, instead of trying to deduplicate in your own pipeline. 这是我建议的方法,而不是尝试在自己的管道中进行重复数据删除。

You can do this by using the withIdAttribute on PubSubIO.Read. 您可以通过使用PubSubIO.Read上的withIdAttribute来实现。 Example . 例子

Some explanation as to why I believe Distinct causes stuckness. 关于为什么我相信Distinct导致卡住的一些解释。 Distinct tries to deduplicate data within a Window. Distinct尝试对Window中的数据进行重复数据删除。 I believe you are trying to deduplciate the global window, so your pipeline must buffer and compare all elements, and since this is an unbounded PCollection. 我相信您正在尝试对全局窗口进行重复数据删除,因此您的管道必须缓冲并比较所有元素,并且由于这是无界的PCollection。 It will try to buffer forever. 它将尝试永远缓冲。

I believe this will work properly if you perform windowing first, and you have deterministic event timestamps (It doesn't look like you are using withTimestampAttribute ). 我相信,如果您先执行窗口化,并且具有确定性的事件时间戳(看起来不像在使用withTimestampAttribute ),则此方法将正常工作。 Then Distinct will apply to just the elements within the window (and identical elements with identical timestamps will be put in the same window). 然后,Distinct将仅应用于窗口内的元素(并且具有相同时间戳的相同元素将放置在同一窗口中)。 You may want to see if this works for prototyping, but I recommend adding unique record ids if possible, and allowing Dataflow to handle duplication based on record id for best performance. 您可能想看看这是否适用于原型制作,但是我建议尽可能添加唯一的记录ID,并允许Dataflow根据记录ID处理重复以达到最佳性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache-Beam将序列号添加到PCollection - Apache-Beam add sequence number to a PCollection 如何将两个结果和 pipe 组合到 apache-beam 管道中的下一步 - How to combine two results and pipe it to next step in apache-beam pipeline 如何在python中防止数组中条目的特定组合的重复数据删除? - How to prevent de-duplication of a particular combination of entries in an array in python? 在 python 编码的 Apache-Beam 管道中提供 BigQuery 凭证 - Provide BigQuery credentials in Apache-Beam pipeline coded in python Python/Apache-Beam:如何将文本文件解析为 CSV? - Python/Apache-Beam: How to Parse Text File To CSV? 如何在 Python 的 Apache-Beam DataFlow 中合并解析的文本文件? - How To Combine Parsed TextFiles In Apache-Beam DataFlow in Python? google colab 中的 apache-beam [gcp] 问题 - apache-beam[gcp] issue in google colab 熊猫重复数据删除并返回重复的索引列表 - Pandas de-duplication and return list of indexes that were duplicates 我们可以在 apache-beam 的批处理管道中使用 Windows + GroupBy 或 State & timely 打破 fusion b/w ParDo 吗? - Can we break fusion b/w ParDo using Windows + GroupBy or State & timely in batch pipeline of apache-beam? Apache 光束侧输入在流数据流管道中不工作 Python SDK - Apache Beam Side Inputs not working in Streaming Dataflow Pipeline with Python SDK
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM