简体   繁体   English

如何在 PCollection 中组合数据 - Apache Beam

[英]How to combine Data in PCollection - Apache beam

I am looking for combining data in a PCollection我正在寻找在 PCollection 中组合数据

input is a CSV file输入是一个CSV文件

customer id,customer name,transction amount,transaction type  
cust123,ravi,100,D  
cust123,ravi,200,D  
cust234,Srini,200,C  
cust444,shaker,500,D  
cust123,ravi,100,C  
cust123,ravi,300,C  

O/p should be O/p 应该是

After reading a Textfile in to a collection of object, i want to combine as the output shown.将文本文件读入对象集合后,我想将其合并为显示的输出。

cust123,ravi,300,D  
cust123,ravi,400,C  
cust234,Srini,200,C  
cust444,shaker,500,D
Pipeline pipeline = Pipeline.create(
   PipelineOptionsFactory.fromArgs(args).withValidation().create());

PCollection< Customer> pCollection =
   pipeline.apply("Read", TextIO.read().from("MyFile.csv"))
           .apply("splitData and store",
               ParDo.of(new TextTransform.SplitValues()))

If I understand it right you need to sum the transaction amounts grouping by customerid+transaction type.如果我理解正确,您需要对按客户 ID+交易类型分组的交易金额求和。 In that case you need to, from high level perspective:在这种情况下,您需要从高层次的角度:

  • assign the keys to the records:将键分配给记录:
    • you can use WithKeys PTransform for that, see the doc ;您可以WithKeys使用WithKeys PTransform请参阅文档
    • the key is up to you, for example you can combine the customer id with transaction type something like: csvField[0] + "," + csvField[3]关键取决于您,例如您可以将客户 ID 与交易类型结合起来,例如: csvField[0] + "," + csvField[3]
  • group the records by the new key using GroupByKey PTransform , see this doc ;使用GroupByKey PTransform按新键对记录进行PTransform请参阅此文档
  • the output of the GBK will be collections of the records with the same key, so you will need to apply a ParDo that will accept such collection (all records belonging to the same customer and transaction type), sum up the amount, output the record with the sum; GBK 的输出将是具有相同键的记录的集合,因此您需要应用接受此类集合的ParDo (属于同一客户和交易类型的所有记录),汇总金额,输出记录与总和;

Last two steps (GBK+ParDo) can probably be replaced by using a Combine.perKey() PTransform , which does the same thing but can be optimized by the runtime.最后两个步骤 (GBK+ParDo) 可能可以通过使用Combine.perKey() PTransform ,它执行相同的操作,但可以通过运行时进行优化。 See this and this for more info.有关更多信息,请参阅

You can also look into Beam SQL that would allow you to express the same logic in SQL.您还可以查看 Beam SQL,它允许您在 SQL 中表达相同的逻辑。 See this doc for Beam SQL overview.有关 Beam SQL 概述,请参阅此文档 In this case you will need to add a ParDo that converts the CSV records to Beam Rows before applying the SqlTransform .在这种情况下,您需要添加一个ParDo ,在应用SqlTransform之前将 CSV 记录转换为 Beam Rows。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何区分两个 PCollection Apache Beam - How to diff two PCollection Apache Beam 如何转换 PCollection<tablerow> 到个人收藏<row>在 Apache 梁?</row></tablerow> - How to convert PCollection<TableRow> to PCollection<Row> in Apache Beam? 如何在 Apache Beam 中为我的 PCollection 使用 AutoValue 数据类型? - How do I use an AutoValue data type for my PCollection in Apache Beam? 如何转换 PCollection<row> 在数据流 Apache 中使用 Java 束</row> - How to convert PCollection<Row> to Long in Dataflow Apache beam using Java 如何使用 Apache Beam 中的流输入 PCollection 请求 Redis 服务器? - How to request Redis server using a streaming input PCollection in Apache Beam? 如何从 PCollection 中提取信息<row>加入 apache 光束后?</row> - How to extract information from PCollection<Row> after a join in apache beam? 如何转换 PCollection<row> 使用 Java 到数据流 Apache 中的 Integer</row> - How to convert PCollection<Row> to Integer in Dataflow Apache beam using Java 如何为 PCollection 设置编码器<List<String> &gt; 在 Apache Beam 中? - How do I set the coder for a PCollection<List<String>> in Apache Beam? 如何将 JSON Array 反序列化为 Apache beam PCollection<javaobject></javaobject> - How to deserialize JSON Array to Apache beam PCollection<javaObject> Apache Beam:扁平化 PCollection <List<Foo> &gt; 到 PCollection<Foo> - Apache Beam: Flattening PCollection<List<Foo>> to PCollection<Foo>
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM