简体繁体 English

Hadoop MapReduce - 按值排序和排序

[英]Hadoop MapReduce - Sum and Sort by value

原文 2015-06-16 10:06:28 4 1 java/ hadoop/ data-structures/ mapreduce

One of my friends was asked this question on hadoop MapReduce - We have multiple stores and each stores have many customers visiting and buying stuff. 我的一位朋友在hadoop MapReduce上被问到这个问题 - 我们有多家商店，每家商店都有很多客户来访和购买东西。 the dataset consists of "Store#, Customer#, Quantity purchased". 数据集由“Store＃，Customer＃，Quantity purchase”组成。 Need a MapReduce code to get the Top 2 customers for each store. 需要MapReduce代码才能获得每个商店的前2名客户。

The solution which i thought of was to do a secondary sort on qty (in descending order - store + qty makes the composite key) and in the reducer just display first 2 values (or customers) for each Key (store + qty, qty is part of composite key). 我想到的解决方案是在qty上进行二次排序（按降序排列 - store + qty使复合键），在reducer中只显示每个Key的前2个值（或者客户）（store + qty，qty是复合键的一部分）。 This works if the customer is unique, but if the customer has visited the same store multiple times then how do we do it? 如果客户是唯一的，但是如果客户多次访问同一商店，那么这是有效的，那么我们该怎么做呢？

The solution is to loop thru each value, add qty for each customer, sort it by qty in the reducer. 解决方案是循环每个值，为每个客户添加数量，在reducer中按数量排序。 This would mean i will be doing the sort logic all over again and not sure if i can use a TreeMap/Hashmap etc since there might be memory constraints. 这意味着我将重新进行排序逻辑，并且不确定我是否可以使用TreeMap / Hashmap等，因为可能存在内存限制。

or the solution is to write 2 MapRed which runs one after the other. 或解决方案是编写2 MapRed，它一个接一个地运行。 The firs one to get a sum of qty purchased for each customer and store. 第一个获得为每个客户和商店购买的数量的总和。 The second MapRed to sort by qty and get the top 2 buyers. 第二个MapRed按数量排序并获得前2名买家。

Any other way of achieving this? 实现这一目标的任何其他方式？ Also considering memory constraints? 还考虑内存限制？