简体   繁体   中英

How to add de-duplication to a streaming pipeline [apache-beam]

I have a working streaming pipeline in apache beam [python] that ingests data from pub/sub, performs enrichment in dataflow and passes it to big-query.

Withing the streaming window, I would like to ensure that messages are not getting duplicated, (as pub/sub guarantees only at least once delivery).

So, I figured I'd just use the distinct method from beam, but as soon as I use it my pipeline breaks (can't proceed any further, any local prints are also not visible).

Here is my pipeline code:

    with beam.Pipeline(options=options) as p:
        message = (p | "ReadFromPubSub" >> beam.io.ReadFromPubSub(topic=known_args.topic).
                   with_output_types(bytes))

        bq_data = (message | "Decode" >> beam.FlatMap(lambda x: [x.decode('utf-8')])
                           | "Deduplication" >> beam.Distinct()
                           | "JSONLoad" >> beam.ParDo(ReadAsJSON())
                           | "Windowing" >> beam.WindowInto(window.FixedWindows(10, 0))
                           | "KeepRelevantData" >> beam.ParDo(KeepRelevantData())
                           | "PreProcessing" >> beam.ParDo(PreProcessing())
                           | "SendLimitedKeys" >> beam.ParDo(SendLimitedKeys(), schema=schema)
                   )

        if not known_args.local:
            bq_data | "WriteToBigQuery" >> beam.io.WriteToBigQuery(table=known_args.bq_table, schema=schema)
        else:
            bq_data | "Display" >> beam.ParDo(Display())

As you can see in the de-duplication label, I'm calling the beam.Distinct method.

Questions:

  1. Where should de-duplication happen in the pipeline?

  2. Is this even a correct/sane approach?

  3. How else can I de-duplicate streaming buffer data?

  4. Is de-duplication even required, or am I just wasting my time?

Any solutions or suggestions would be highly appreciated. Thanks.

You may find this blog on Exactly-once processing helpful. First, Dataflow is already performing deduplication based on the pub/sub record id. However, as the blog states: "In some cases however, this is not quite enough. The user's publishing process might retry publishes".

So, if your system which is publishing messages to Pub/Sub may be publishing the same message multiple times, then you may wish to add your own deterministic record id. Then Cloud Dataflow will detect these. This is the approach I would recommend, instead of trying to deduplicate in your own pipeline.

You can do this by using the withIdAttribute on PubSubIO.Read. Example .

Some explanation as to why I believe Distinct causes stuckness. Distinct tries to deduplicate data within a Window. I believe you are trying to deduplciate the global window, so your pipeline must buffer and compare all elements, and since this is an unbounded PCollection. It will try to buffer forever.

I believe this will work properly if you perform windowing first, and you have deterministic event timestamps (It doesn't look like you are using withTimestampAttribute ). Then Distinct will apply to just the elements within the window (and identical elements with identical timestamps will be put in the same window). You may want to see if this works for prototyping, but I recommend adding unique record ids if possible, and allowing Dataflow to handle duplication based on record id for best performance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM