简体   繁体   English

在Hadoop map-reduce中对联接的数据进行分组

[英]Grouping joined data in Hadoop map-reduce

I have two different types of files, the one type is a list of users. 我有两种不同类型的文件,一种是用户列表。 It have following structure: UserID,Name,CountryID 它具有以下结构: UserID,Name,CountryID

And the second type is orders list: OrderID,UserID,OrderSum 第二种类型是订单列表: OrderID,UserID,OrderSum

Each user have lots of orders. 每个用户都有很多订单。 I need to write map-reduce hadoop job (in java) and receive output with following structure: CountryID,NumOfUsers,MinOrder,MaxOrder 我需要编写map-reduce hadoop作业(在Java中)并接收具有以下结构的输出: CountryID,NumOfUsers,MinOrder,MaxOrder

It's not a problem for me to write a two different mappers (for each file type) and one reducer in order to join data from both files by UserID, and receive following structure: UserID,CountryID,UsersMinOrder,UsersMaxOrder 对于我来说,编写两个不同的映射器(针对每种文件类型)和一个简化器以通过UserID 合并来自两个文件的数据并接收以下结构对我来说不是问题: UserID,CountryID,UsersMinOrder,UsersMaxOrder

But i don't understand how do i group that data by CountryID ? 但是我不明白如何按CountryID将数据分组?

I'd recommend this running this through Pig or Hive, as you can then solve this kind of thing with just a few lines. 我建议通过Pig或Hive运行此程序,因为您只需几行就可以解决此类问题。

Failing that, I would do the following. 如果失败,我将执行以下操作。 Run another MapReduce job on your joined data, and do the following: in your mapper, for each input split keep tabs on min order, max order, and number of tuples (rows with unique user id) processed per country id. 在联接的数据上运行另一个MapReduce作业,然后执行以下操作:在映射器中,对于每个输入拆分,请保留每个国家/地区ID的最小顺序,最大顺序和元组(具有唯一用户ID的行)数量的标签。 There are only a few countries, so you can keep these stats in memory throughout the map job. 只有几个国家/地区,因此您可以在整个地图工作中将这些统计信息保留在内存中。 At the end of the split, output accumulated stats to the reducer keyed by country id. 拆分结束时,将累积的统计信息输出到以国家/地区ID为键的减速器。 Then the reducer simply combines aggregated data from each split to find the global max, min and count. 然后,reducer可以简单地合并每个拆分的汇总数据,以找到全局最大值,最小值和计数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM