简体   繁体   English

如何在同一输入上运行两个不同的映射器,并将其输出发送到单个reducer?

[英]How do I run two different mappers on the same input and have their output sent to a single reducer?

I have some flight data (each line containing origin, destination, flight number, etc) and I need to process it to output flight details between all origins and destinations with one stopover , my idea is to have two mappers (one outputs destination as key and the other outputs origin as key, therefore the reducer gets the stopover location as key and all origin and destination as an array of values). 我有一些航班数据(每行包含起点,目的地,航班号等),我需要对其进行处理,以通过一个停留点输出所有起点和目的地之间的航班详细信息,我的想法是拥有两个映射器(一个输出目的地作为键另一个输出原点作为键,因此减速器将中途停留位置作为键,并将所有原点和目标作为值的数组)。 Then I can output flight details with one stopover for all locations in the reducer. 然后,我可以输出减速机中所有位置的一个中途停留的航班详细信息。

So my question is how do I run two different mappers on the same input file and have their output sent to one reducer. 所以我的问题是如何在同一个输入文件上运行两个不同的映射器,并将它们的输出发送到一个reducer。

I read about MultipleInputs.addInputPath , but I guess it needs input to be different (or atleast two copies of the same input). 我读到有关MultipleInputs.addInputPath ,但我猜想它需要的输入是不同的(或同一输入的至少两个副本)。

I am thinking of running the two mapper jobs independently using a workflow and then a third Identity mapper and reducer where I will do the flight calculation. 我正在考虑使用工作流独立运行两个映射器作业,然后使用第三个Identity映射器和reducer来执行飞行计算。

Is there a better solution that this? 有没有更好的解决方案呢? (Please do not ask me to use Hive, am not comfortable with it yet) Any guidance on implementing using mapreduce would really help. (请不要要求我使用Hive,现在还不满意)。有关使用mapreduce实施的任何指导确实会有所帮助。 Thanks. 谢谢。

Your question did not specify if you wish to mix/match (stopover/no stopovers) together. 您的问题未指定您是否要混合/匹配(中途停留/无中途停留)。

So I will go ahead with the stated question: that is only consider one (not zero) stopovers. 因此,我将继续陈述问题:仅考虑一次(而不是零)中途停留。

In that case simply have two Map/Reduce stages. 在这种情况下,只需有两个Map / Reduce阶段。 First stage Mapper outputs 第一阶段Mapper输出

(dest1, source1). 

First stage reducer receives (dest1, Array(source1, source2, ...) 第一级reducer接收(dest1,Array(source1,source2,...)

The first stage reducer then writes its tuples to hdfs output directory. 然后,第一级减速器将其元组写入hdfs输出目录。

Now do the second stage: the mapper input uses the Stage1 reducer output as its source directory. 现在执行第二阶段:映射器输入将Stage1减速器输出用作其源目录。

Second stage mapper reads: 第二阶段的映射器显示:

(dest1, Array(source1, source2, ...)) (dest1,Array(source1,source2,...))

Second stage mapper outputs: 第二阶段的映射器输出:

 (dest2, (source1,dest1))

Then your final (stage2) reducer receives: 然后,您的最终(stage2)reducer将收到:

(dest2,  Array( (source11,dest11), (source12, dest12), (source13, dest13) ,...)

and it writes that data to the hdfs output. 并将该数据写入hdfs输出。 You can then use any external tools you like to read those results from hdfs. 然后,您可以使用任何喜欢的外部工具从hdfs读取这些结果。

I think you can do it with just one Mapper. 我认为您只需一个Mapper就可以做到。

The Mapper emits each (src,dst,fno,...) input record twice, once as (src,(src,dst,fno,...)) and once as (dst,(src,dst,fno,...)) . 映射器发出两次每个(src,dst,fno,...)输入记录,一次作为(src,(src,dst,fno,...)) ,一次作为(dst,(src,dst,fno,...)) (src,(src,dst,fno,...)) (dst,(src,dst,fno,...)) In the Reducer you need to figure out for each record whether its key is a source or destination and do the stop-over join. 在Reducer中,您需要为每条记录确定其键是源还是目标,然后执行停靠联接。 Using a flag to indicate the role of the key and a secondary sort can make this a bit more efficient. 使用标志来指示键的作用和辅助排序可以使此操作更加有效。

That way only a single MR job with one Mapper and one Reducer is necessary for the task. 这样,只需一个MR工作和一个Mapper和一个Reducer即可完成该任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将两个映射器组合成一个减速器 - How to combine two mappers to one reducer 在hadoop中实现多个映射器和单个减速器 - Implementing mulitple mappers and single reducer in hadoop 如何实例化同一对象的两个线程,并使对象打印不同的东西 - How do I instantiate two threads of the same object, and have the objects print different things 如何解析具有相同名称但父母不同的两个节点? - How do I resolve two nodes which have the same name but under different parents? 如何使用相同的输入创建两个不同的补充列表 - How do I create two different compliementary lists using same input 我在同一个程序包中有两个Java文件-如何同时运行两个文件的测试? - I have two java files in the same package- How do I run a test of both files at the same time? 如何同时运行两个不同的主类? - How can I run two different main class at the same time? Java:每次为相同的输入运行程序时都会获得不同的输出 - Java: Getting different output each time I run the program for the same input Java Hadoop:我如何创建作为输入文件的输出器并给出一个输出,即每个文件中的行数? - Java Hadoop: How can I create mappers that take as input files and give an output which is the number of lines in each file? 两个相等的对象必须具有相同的toString输出吗? - Do two equal objects have to have the same toString output?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM