简体繁体中英

Control number of hadoop mapper output files

原文 2013-07-19 11:33:36 6 4 java/ hadoop/ mapreduce

I have a job for hadoop. When the job is stated, i have some number of mappers started. And each mapper write some file to disk, like part-m-00000, part-m-00001. As I understand, each mapper create one part file. I have big amount of data, so there must be more than one mapper, but can I somehow control number of this output files? I mean, hadoop will start, for example 10 mappers, but there will be only three part files?

I found this post How do multiple reducers output only one part-file in Hadoop? But there is using old version of hadoop library. I'm using classes from org.apache.hadoop.mapreduce.* and not from org.apache.hadoop.mapred.*

I'm using hadoop version 0.20, and hadoop-core:1.2.0.jar

Is there any possibility to do this, using new hadoop API?

4 answers

The number of output files equals to the number of reducers or the number of the mappers if there aren't any reducers.

You can add a single reducer to your job so that the output from all the mappers will be directed to it and your get a single output file. Note that will be less efficient as all the data (output of mappers) will be sent over the wire (network IO) to the node where the reducer will run. Also since a single process will (eventually) get all the data it would probably run slower.

By the wat,the fact that there are multiple parts shouldn't be very significant as you can pass the directory containing them to subsequent jobs

Im not sure you can do it (your link is about multiple outputs not converging to only one), and why use only one output ? you will lose all parallelism on sort ?

Im also working on big files (~10GB each) and my MR process almost 100GB each. So to lower Map numbers, I set a higher value of block size in hdfs (applies only to newer files) and a higher value of mapred.min.split.size in mapred-site.xml

You might want to look at MultipleOutputFormat

Part of what Javadoc says:

This abstract class extends the FileOutputFormat, allowing to write the output data to different output files.

Both Mapper and Reducer can use this.

Check this link for how you can specify a output file name or more from different mappers to output to HDFS.

NOTE: And, moreover, make sure you don't use context.write() so that, 10 files from 10 mapper don't get created. Use only MultipleOutputFormat to output.

If the job has no reducers, partitioners and combiners, each mapper outputs one output file. At some point, you should run some post processing to collect the outputs into large file.

HADOOP - number of output files produced as mapper output

Hadoop MapReduce access mapper output number in reducer

Hadoop Mapreduce: Is it possible to write mapper output to separate output files(not intermediate ones) without setting number of reducers to zero?

Where Mapper output in Hadoop is saved?

Hadoop mapper and reducer output mismatch

What is the control flow of Hadoop mapper with MultipleInputs?

Hadoop jar command error for multiple mapper inputs and 1 reducer output (Join 2 values from 2 files)

Hadoop (java) change the type of Mapper output values

Hadoop returns the output of mapper instead of reducer

How to Output Mapper Key as Text to the Reducer in hadoop

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question HADOOP - number of output files produced as mapper output Hadoop MapReduce access mapper output number in reducer Hadoop Mapreduce: Is it possible to write mapper output to separate output files(not intermediate ones) without setting number of reducers to zero? Where Mapper output in Hadoop is saved? Hadoop mapper and reducer output mismatch What is the control flow of Hadoop mapper with MultipleInputs? Hadoop jar command error for multiple mapper inputs and 1 reducer output (Join 2 values from 2 files) Hadoop (java) change the type of Mapper output values Hadoop returns the output of mapper instead of reducer How to Output Mapper Key as Text to the Reducer in hadoop

Related Tags

Control number of hadoop mapper output files

Question

4 answers

solution1
5 ACCPTED 2013-07-19 14:16:17

solution2
0 2013-07-19 11:55:56

solution3
0 2013-07-19 12:19:08

solution4
0 2015-11-11 07:13:17

Control number of hadoop mapper output files

Question

4 answers

solution1 5 ACCPTED 2013-07-19 14:16:17

solution2 0 2013-07-19 11:55:56

solution3 0 2013-07-19 12:19:08

solution4 0 2015-11-11 07:13:17

solution1
5 ACCPTED 2013-07-19 14:16:17

solution2
0 2013-07-19 11:55:56

solution3
0 2013-07-19 12:19:08

solution4
0 2015-11-11 07:13:17