简体   繁体   中英

How to combine Data in PCollection - Apache beam

I am looking for combining data in a PCollection

input is a CSV file

customer id,customer name,transction amount,transaction type  
cust123,ravi,100,D  
cust123,ravi,200,D  
cust234,Srini,200,C  
cust444,shaker,500,D  
cust123,ravi,100,C  
cust123,ravi,300,C  

O/p should be

After reading a Textfile in to a collection of object, i want to combine as the output shown.

cust123,ravi,300,D  
cust123,ravi,400,C  
cust234,Srini,200,C  
cust444,shaker,500,D
Pipeline pipeline = Pipeline.create(
   PipelineOptionsFactory.fromArgs(args).withValidation().create());

PCollection< Customer> pCollection =
   pipeline.apply("Read", TextIO.read().from("MyFile.csv"))
           .apply("splitData and store",
               ParDo.of(new TextTransform.SplitValues()))

If I understand it right you need to sum the transaction amounts grouping by customerid+transaction type. In that case you need to, from high level perspective:

  • assign the keys to the records:
    • you can use WithKeys PTransform for that, see the doc ;
    • the key is up to you, for example you can combine the customer id with transaction type something like: csvField[0] + "," + csvField[3]
  • group the records by the new key using GroupByKey PTransform , see this doc ;
  • the output of the GBK will be collections of the records with the same key, so you will need to apply a ParDo that will accept such collection (all records belonging to the same customer and transaction type), sum up the amount, output the record with the sum;

Last two steps (GBK+ParDo) can probably be replaced by using a Combine.perKey() PTransform , which does the same thing but can be optimized by the runtime. See this and this for more info.

You can also look into Beam SQL that would allow you to express the same logic in SQL. See this doc for Beam SQL overview. In this case you will need to add a ParDo that converts the CSV records to Beam Rows before applying the SqlTransform .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM