简体   繁体   中英

Hadoop MapReduce - Sum and Sort by value

One of my friends was asked this question on hadoop MapReduce - We have multiple stores and each stores have many customers visiting and buying stuff. the dataset consists of "Store#, Customer#, Quantity purchased". Need a MapReduce code to get the Top 2 customers for each store.

The solution which i thought of was to do a secondary sort on qty (in descending order - store + qty makes the composite key) and in the reducer just display first 2 values (or customers) for each Key (store + qty, qty is part of composite key). This works if the customer is unique, but if the customer has visited the same store multiple times then how do we do it?

The solution is to loop thru each value, add qty for each customer, sort it by qty in the reducer. This would mean i will be doing the sort logic all over again and not sure if i can use a TreeMap/Hashmap etc since there might be memory constraints.

or the solution is to write 2 MapRed which runs one after the other. The firs one to get a sum of qty purchased for each customer and store. The second MapRed to sort by qty and get the top 2 buyers.

Any other way of achieving this? Also considering memory constraints?

尝试使用复合键作为客户+存储,然后使用reducer和map reduce框架,对它们进行分组和计数

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM