如何在 PCollection 中组合数据 - Apache Beam

Question

I am looking for combining data in a PCollection我正在寻找在 PCollection 中组合数据

input is a CSV file输入是一个CSV文件

customer id,customer name,transction amount,transaction type  
cust123,ravi,100,D  
cust123,ravi,200,D  
cust234,Srini,200,C  
cust444,shaker,500,D  
cust123,ravi,100,C  
cust123,ravi,300,C

O/p should be O/p 应该是

After reading a Textfile in to a collection of object, i want to combine as the output shown.将文本文件读入对象集合后，我想将其合并为显示的输出。

cust123,ravi,300,D  
cust123,ravi,400,C  
cust234,Srini,200,C  
cust444,shaker,500,D

Pipeline pipeline = Pipeline.create(
   PipelineOptionsFactory.fromArgs(args).withValidation().create());

PCollection< Customer> pCollection =
   pipeline.apply("Read", TextIO.read().from("MyFile.csv"))
           .apply("splitData and store",
               ParDo.of(new TextTransform.SplitValues()))

Answer 1

If I understand it right you need to sum the transaction amounts grouping by customerid+transaction type.如果我理解正确，您需要对按客户 ID+交易类型分组的交易金额求和。 In that case you need to, from high level perspective:在这种情况下，您需要从高层次的角度：

assign the keys to the records:将键分配给记录：
- you can use WithKeys PTransform for that, see the doc ;您可以WithKeys使用WithKeys PTransform ，请参阅文档；
- the key is up to you, for example you can combine the customer id with transaction type something like: csvField[0] + "," + csvField[3]关键取决于您，例如您可以将客户 ID 与交易类型结合起来，例如： csvField[0] + "," + csvField[3]
group the records by the new key using GroupByKey PTransform , see this doc ;使用GroupByKey PTransform按新键对记录进行PTransform ，请参阅此文档；
the output of the GBK will be collections of the records with the same key, so you will need to apply a ParDo that will accept such collection (all records belonging to the same customer and transaction type), sum up the amount, output the record with the sum; GBK 的输出将是具有相同键的记录的集合，因此您需要应用接受此类集合的ParDo （属于同一客户和交易类型的所有记录），汇总金额，输出记录与总和；

Last two steps (GBK+ParDo) can probably be replaced by using a Combine.perKey() PTransform , which does the same thing but can be optimized by the runtime.最后两个步骤 (GBK+ParDo) 可能可以通过使用Combine.perKey() PTransform ，它执行相同的操作，但可以通过运行时进行优化。 See this and this for more info.有关更多信息，请参阅此和此。

You can also look into Beam SQL that would allow you to express the same logic in SQL.您还可以查看 Beam SQL，它允许您在 SQL 中表达相同的逻辑。 See this doc for Beam SQL overview.有关 Beam SQL 概述，请参阅此文档。 In this case you will need to add a ParDo that converts the CSV records to Beam Rows before applying the SqlTransform .在这种情况下，您需要添加一个ParDo ，在应用SqlTransform之前将 CSV 记录转换为 Beam Rows。

如何在 PCollection 中组合数据 - Apache Beam

问题描述

1 个解决方案

解决方案1
1 2019-06-14 17:19:13

如何在 PCollection 中组合数据 - Apache Beam

问题描述

1 个解决方案

解决方案1 1 2019-06-14 17:19:13

解决方案1
1 2019-06-14 17:19:13