简体   繁体   中英

How to calculate the number of elements of a PCollection in Apache beam

number_items = lines | 'window' >> beam.WindowInto(window.GlobalWindows()) \
    | 'CountGlobally' >> beam.combiners.Count.Globally() \
    | 'print' >> beam.ParDo(PrintFn())

I tried to display that via prints and logs but I found nothing

class PrintFn(beam.DoFn):
    def process(self, element):
        print(element)
        logging.error(element)
        return [element]

I found strange to want to count elements of an unbounded collection. My first feeling is that never go after the global window, because Beam wait the end on the unbounded collection... Except is you perform a trigger.

Digging in the documentation, I found this

Set a non-default trigger. This allows the global window to emit results under other conditions, since the default windowing behavior (waiting for all data to arrive) will never occur

I'm right, with trigger, the end never occur, it's unbounded, unlimited.

Did you try to skip the window and directly count globally?

For Batch , you can simply do

def print_row(element):
  print element

count_pcol = (
              lines
              | 'Count elements' >> beam.combiners.Count.Globally()
              | 'Print result' >> beam.Map(print_row)
            )

beam.combiners.Count.Globally() is a PTransform that uses global combine to count all the elements of a PCollection and produce a single value.


For Streaming , counting elements is not possible because the source is an unbounded pcollection ie it never ends. CombineGlobally in your case will keep on waiting for the input and never produce an output.

A possible solution could be to set a window function and a non-default trigger.

I have written a simple pipeline that divides elements in fixed windows of 20 seconds and counts per key for each window. You can change window and trigger based on your requirements.

def form_pair(data):
  return 1, data

def print_row(element):
      print element

count_pcol = (
                p 
                | 'Read from pub sub' >> beam.io.ReadFromPubSub(subscription=input_subscription)
                | 'Form key value pair' >> beam.Map(form_pair)
                | 'Apply windowing and triggers' >> 
                                       beam.WindowInto(window.FixedWindows(20),
                                       trigger=AfterProcessingTime(5), 
                                       accumulation_mode=AccumulationMode.DISCARDING)
                | 'Count elements by key' >> beam.combiners.Count.PerKey()
                | 'Print result' >> beam.Map(print_row)
               )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM