Im trying to ouput a SUM and a COUNT for the same key. Eg. Given a .csv with millions of events of plane delays. Using Apache Beam (Java) I want to SUM the durations of the delays for each plane, and COUNT how many delays each plane had.
each row has plane_id, delay_duration, date
, etc.
Im trying to create two PCollections and want to kind of merge them before output.
PCollection<KV<String, Integer>> sum = eventInfo.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(),TypeDescriptors.integers())).via((Event.EventInfo gInfo) -> KV.of(gInfo.getKey('plane_id'), gInfo.getDuration()))).apply(Sum.integersPerKey());
PCollection<KV<String, Long>> count = eventInfo.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers())).via((Event.EventInfo gInfo) -> KV.of(gInfo.getKey('plane_id'), gInfo.getDuration()))).apply(Count.perKey());
This two PCollections work as expected, but I can't figure it out how to output it (merge it?) in 3 columns key | sum | count.
您将需要CoGBK ,这将帮助您共同定位总和并为同一密钥计数。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.