[英]How to combine Data in PCollection - Apache beam
I am looking for combining data in a PCollection我正在寻找在 PCollection 中组合数据
input is a CSV file输入是一个CSV文件
customer id,customer name,transction amount,transaction type
cust123,ravi,100,D
cust123,ravi,200,D
cust234,Srini,200,C
cust444,shaker,500,D
cust123,ravi,100,C
cust123,ravi,300,C
O/p should be O/p 应该是
After reading a Textfile in to a collection of object, i want to combine as the output shown.将文本文件读入对象集合后,我想将其合并为显示的输出。
cust123,ravi,300,D
cust123,ravi,400,C
cust234,Srini,200,C
cust444,shaker,500,D
Pipeline pipeline = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection< Customer> pCollection =
pipeline.apply("Read", TextIO.read().from("MyFile.csv"))
.apply("splitData and store",
ParDo.of(new TextTransform.SplitValues()))
If I understand it right you need to sum the transaction amounts grouping by customerid+transaction type.如果我理解正确,您需要对按客户 ID+交易类型分组的交易金额求和。 In that case you need to, from high level perspective:
在这种情况下,您需要从高层次的角度:
WithKeys
PTransform
for that, see the doc ;WithKeys
使用WithKeys
PTransform
, 请参阅文档;csvField[0] + "," + csvField[3]
csvField[0] + "," + csvField[3]
GroupByKey
PTransform
, see this doc ;GroupByKey
PTransform
按新键对记录进行PTransform
, 请参阅此文档;ParDo
that will accept such collection (all records belonging to the same customer and transaction type), sum up the amount, output the record with the sum; ParDo
(属于同一客户和交易类型的所有记录),汇总金额,输出记录与总和; Last two steps (GBK+ParDo) can probably be replaced by using a Combine.perKey()
PTransform
, which does the same thing but can be optimized by the runtime.最后两个步骤 (GBK+ParDo) 可能可以通过使用
Combine.perKey()
PTransform
,它执行相同的操作,但可以通过运行时进行优化。 See this and this for more info.有关更多信息,请参阅此和此。
You can also look into Beam SQL that would allow you to express the same logic in SQL.您还可以查看 Beam SQL,它允许您在 SQL 中表达相同的逻辑。 See this doc for Beam SQL overview.
有关 Beam SQL 概述,请参阅此文档。 In this case you will need to add a
ParDo
that converts the CSV records to Beam Rows before applying the SqlTransform
.在这种情况下,您需要添加一个
ParDo
,在应用SqlTransform
之前将 CSV 记录转换为 Beam Rows。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.