简体   繁体   中英

Google Dataflow - Apache Beam GroupByKey(): Duplicating/Slow

I'm facing a situation with beam.GroupByKey(), I've loaded a file whose amount of lines is 42.854.

Due to the business rules, I need to execute a GroupByKey(); However, After finishing its execution I noticed that I got almost the double of lines. As you can see below:

在此处输入图像描述

Step before the GroupByKey():

在此处输入图像描述

Why am I having this behavior?

I'm not doing anything special in my pipeline:

with beam.Pipeline(runner, options=opts) as p:

    #LOAD FILE
    elements = p | 'Load File' >> beam.Create(fileRNS.values)

    #PREPARE VALUES (BULK INSERT)
    Script_Values = elements | 'Prepare Bulk Insert' >> beam.ParDo(Prepare_Bulk_Insert())  
            
    Grouped_Values = Script_Values | 'Grouping values' >> beam.GroupByKey()

    #BULK INSERT INTO POSTGRESQL
    Grouped_Values | 'Insert PostgreSQL' >> beam.ParDo(ExecuteInsert)  
    

2021-02-09

When I debug, Prepare_Bulk_Insert() has the following content:

在此处输入图像描述

As you can see, the amount of elements is correct, I don't understand why GroupByKey() has its input with a higher amount of elements if I'm sending the correct amount.

The Grouped_Values | 'Insert PostgreSQL' >> beam.ParDo(funcaoMap) has its input as follows:

在此处输入图像描述

Double amount. =(

Kind regards, Juliano Medeiros

Those screenshots indicate that the "Prepare Bulk Insert" DoFn is outputting more that one element per input element. Your first screenshot is showing the input PCollection of the GBK (which is produced by the DoFn) and the second is the input to the DoFn, so the difference must be produced by that DoFn.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM