简体   繁体   中英

How to sum two Apache Spark JavaPairRDDs?

I have the following JavaPairRDDs which represent the number of orders for each customer:

JavaPairRDD<String, Integer> customersToOrderCountRDD1 = ...

JavaPairRDD<String, Integer> customersToOrderCountRDD2 = ...

where the first one is retrieved from a table in Cassandra and the second one is retrieved from an external Web API.

What is the most efficient way to compute the combined values of these two RDDs, in other words to get the total order count for each customer: For example if I have the following set of data in the RDDs:

customersToOrderCountRDD1: (email1@email.com, 3) (email2@email.com, 4)
customersToOrderCountRDD2: (email1@email.com, 1) (email2@email.com, 2)

to get:

customersToTotalOrderCount: (email1@email.com, 4) (email2@email.com, 6)

Please refer to "working with key-value pairs" section and union/reduceByKey API:
http://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs

customersToOrderCountRDD1.union(customersToOrderCountRDD2).reduceByKey((a, b) -> a + b)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM