I am looking for combining data in a PCollection
input is a CSV file
customer id,customer name,transction amount,transaction type
cust123,ravi,100,D
cust123,ravi,200,D
cust234,Srini,200,C
cust444,shaker,500,D
cust123,ravi,100,C
cust123,ravi,300,C
O/p should be
After reading a Textfile in to a collection of object, i want to combine as the output shown.
cust123,ravi,300,D
cust123,ravi,400,C
cust234,Srini,200,C
cust444,shaker,500,D
Pipeline pipeline = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection< Customer> pCollection =
pipeline.apply("Read", TextIO.read().from("MyFile.csv"))
.apply("splitData and store",
ParDo.of(new TextTransform.SplitValues()))
If I understand it right you need to sum the transaction amounts grouping by customerid+transaction type. In that case you need to, from high level perspective:
WithKeys
PTransform
for that, see the doc ;csvField[0] + "," + csvField[3]
GroupByKey
PTransform
, see this doc ;ParDo
that will accept such collection (all records belonging to the same customer and transaction type), sum up the amount, output the record with the sum; Last two steps (GBK+ParDo) can probably be replaced by using a Combine.perKey()
PTransform
, which does the same thing but can be optimized by the runtime. See this and this for more info.
You can also look into Beam SQL that would allow you to express the same logic in SQL. See this doc for Beam SQL overview. In this case you will need to add a ParDo
that converts the CSV records to Beam Rows before applying the SqlTransform
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.