简体   繁体   中英

How to merge two PCollection KV<> by key?

Im trying to ouput a SUM and a COUNT for the same key. Eg. Given a .csv with millions of events of plane delays. Using Apache Beam (Java) I want to SUM the durations of the delays for each plane, and COUNT how many delays each plane had.

each row has plane_id, delay_duration, date , etc.

Im trying to create two PCollections and want to kind of merge them before output.

PCollection<KV<String, Integer>> sum =  eventInfo.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(),TypeDescriptors.integers())).via((Event.EventInfo gInfo) -> KV.of(gInfo.getKey('plane_id'), gInfo.getDuration()))).apply(Sum.integersPerKey());

PCollection<KV<String, Long>> count =  eventInfo.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers())).via((Event.EventInfo gInfo) -> KV.of(gInfo.getKey('plane_id'), gInfo.getDuration()))).apply(Count.perKey());

This two PCollections work as expected, but I can't figure it out how to output it (merge it?) in 3 columns key | sum | count.

您将需要CoGBK ,这将帮助您共同定位总和并为同一密钥计数。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM