如何在同一输入上运行两个不同的映射器，并将其输出发送到单个reducer？

Question

I have some flight data (each line containing origin, destination, flight number, etc) and I need to process it to output flight details between all origins and destinations with one stopover , my idea is to have two mappers (one outputs destination as key and the other outputs origin as key, therefore the reducer gets the stopover location as key and all origin and destination as an array of values). 我有一些航班数据（每行包含起点，目的地，航班号等），我需要对其进行处理，以通过一个停留点输出所有起点和目的地之间的航班详细信息，我的想法是拥有两个映射器（一个输出目的地作为键另一个输出原点作为键，因此减速器将中途停留位置作为键，并将所有原点和目标作为值的数组）。 Then I can output flight details with one stopover for all locations in the reducer. 然后，我可以输出减速机中所有位置的一个中途停留的航班详细信息。

So my question is how do I run two different mappers on the same input file and have their output sent to one reducer. 所以我的问题是如何在同一个输入文件上运行两个不同的映射器，并将它们的输出发送到一个reducer。

I read about MultipleInputs.addInputPath , but I guess it needs input to be different (or atleast two copies of the same input). 我读到有关MultipleInputs.addInputPath ，但我猜想它需要的输入是不同的（或同一输入的至少两个副本）。

I am thinking of running the two mapper jobs independently using a workflow and then a third Identity mapper and reducer where I will do the flight calculation. 我正在考虑使用工作流独立运行两个映射器作业，然后使用第三个Identity映射器和reducer来执行飞行计算。

Is there a better solution that this? 有没有更好的解决方案呢？ (Please do not ask me to use Hive, am not comfortable with it yet) Any guidance on implementing using mapreduce would really help. （请不要要求我使用Hive，现在还不满意）。有关使用mapreduce实施的任何指导确实会有所帮助。 Thanks. 谢谢。

Answer 1

Your question did not specify if you wish to mix/match (stopover/no stopovers) together. 您的问题未指定您是否要混合/匹配（中途停留/无中途停留）。

So I will go ahead with the stated question: that is only consider one (not zero) stopovers. 因此，我将继续陈述问题：仅考虑一次（而不是零）中途停留。

In that case simply have two Map/Reduce stages. 在这种情况下，只需有两个Map / Reduce阶段。 First stage Mapper outputs 第一阶段Mapper输出

(dest1, source1).

First stage reducer receives (dest1, Array(source1, source2, ...) 第一级reducer接收（dest1，Array（source1，source2，...）

The first stage reducer then writes its tuples to hdfs output directory. 然后，第一级减速器将其元组写入hdfs输出目录。

Now do the second stage: the mapper input uses the Stage1 reducer output as its source directory. 现在执行第二阶段：映射器输入将Stage1减速器输出用作其源目录。

Second stage mapper reads: 第二阶段的映射器显示：

(dest1, Array(source1, source2, ...)) （dest1，Array（source1，source2，...））

Second stage mapper outputs: 第二阶段的映射器输出：

 (dest2, (source1,dest1))

Then your final (stage2) reducer receives: 然后，您的最终（stage2）reducer将收到：

(dest2,  Array( (source11,dest11), (source12, dest12), (source13, dest13) ,...)

and it writes that data to the hdfs output. 并将该数据写入hdfs输出。 You can then use any external tools you like to read those results from hdfs. 然后，您可以使用任何喜欢的外部工具从hdfs读取这些结果。

Answer 2

I think you can do it with just one Mapper. 我认为您只需一个Mapper就可以做到。

The Mapper emits each (src,dst,fno,...) input record twice, once as (src,(src,dst,fno,...)) and once as (dst,(src,dst,fno,...)) . 映射器发出两次每个(src,dst,fno,...)输入记录，一次作为(src,(src,dst,fno,...)) ，一次作为(dst,(src,dst,fno,...)) (src,(src,dst,fno,...)) (dst,(src,dst,fno,...)) 。 In the Reducer you need to figure out for each record whether its key is a source or destination and do the stop-over join. 在Reducer中，您需要为每条记录确定其键是源还是目标，然后执行停靠联接。 Using a flag to indicate the role of the key and a secondary sort can make this a bit more efficient. 使用标志来指示键的作用和辅助排序可以使此操作更加有效。

That way only a single MR job with one Mapper and one Reducer is necessary for the task. 这样，只需一个MR工作和一个Mapper和一个Reducer即可完成该任务。

如何在同一输入上运行两个不同的映射器，并将其输出发送到单个reducer？

问题描述

2 个解决方案

解决方案1
0 2015-03-29 18:53:49

解决方案2
0 已采纳 2015-03-30 20:45:58

如何在同一输入上运行两个不同的映射器，并将其输出发送到单个reducer？

问题描述

2 个解决方案

解决方案1 0 2015-03-29 18:53:49

解决方案2 0 已采纳 2015-03-30 20:45:58

解决方案1
0 2015-03-29 18:53:49

解决方案2
0 已采纳 2015-03-30 20:45:58