简体   繁体   中英

How combine & log PCollection correctly after GlobalWindow?

I am running a streaming pipeline that is supposed to ingest a Web-Shop clickstream from PubSub into BigQuery. Beam with Dataflow runner is supposed to help me aggregate the value of purchase events per user. I want to trigger the aggregation per every 10 incoming events.

This should be straightforward, but I am relatively new to beam and seem to be misunderstanding some basic concepts about windowing.

Below the aggregation part of my pipeline.

  • 1st, I am filtering for the right event types (working.).

  • 2nd, I am applying a global window with repeated trigger.

  • 3rd, my composite transform first extracts the right field and assigns a key.

  • 4th, my composite transform combines per key and sums.

In this setup I was expecting to receive the grouped sums printed per every 10 items.

Actually nothing is logged, it seems like my ExtractAndSumValue transform is not even triggered, since my print statement is not logged.

What am I missing here?

Pipeline Setup:

fixed_windowed_items = (json_message
                            | 'Filter for purchase' >> beam.Filter(is_purchase)
                            | 'Global Window' >> beam.WindowInto(beam.window.GlobalWindows(),
                                                                 trigger=trigger.Repeatedly(
                                                                     trigger.AfterCount(10)),
accumulation_mode=trigger.AccumulationMode.ACCUMULATING)
                            | 'ExtractAndSumValue' >> ExtractAndSumValue()
                            | 'Print PColl' >> beam.Map(print)
                            )

Composite Transform:

class ExtractAndSumValue(beam.PTransform):
  def __init__(self):
    beam.PTransform.__init__(self)

  def expand(self, pcoll):
    print('expanding!')
    return(
        pcoll
        | beam.Map(lambda elem: (elem['client_id'], elem['ecommerce']['purchase']['value']))
        | beam.CombinePerKey(sum))

If I apply beam.ParDo(print) right after the global window I see the full json messages in the logs as expected in batches of 10.

I've struggled with something similar just today regarding GroupByKey and I believe it applies to CombinePerKey as well.

According to the documentation (however, not at the 'correct' position, ie the GroupByKey section)

We then set a windowing function for that PCollection. The GroupByKey transform groups the elements of the PCollection by both key and window , based on the windowing function.

The trigger.AfterCount(..) is misleading. It does not mean that every 10 input elements, something is happening in your CombinePerKey . Instead, Beam groups/combines your PCollection according to both the key and window. In your case, this means it waits until it gets 10 messages for each client_id , before summing it up. Apparently, your input data does not contain enough messages belonging to one single client.

I've tested this with GroupByKey , where I published 9 messages belonging to the same key to PubSub. The print was never executed. However, once I published message number 10, it immediately printed out the expected result.

To make a long story short, you would need to add additional triggers , such as event/processing time. Alternatively, you could set trigger.AfterCount(1) , but then your CombinePerKey would not make sense anymore.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM