By default, there can be two mapper for a job in hadoop 2.7.3. I have a cluster of 2 system with 4 core available on each. One is master and one is worker. Now I want to run 3 map tasks worker node. Can I do it? I am using hadoop streaming to run a job. So what argument should I set for this purpose. Also I want to set one input(line) to one mapper only. What should be the format of arguments. My current command that does not fullfil the job is
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar\
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-files test.py -mapper test.py -reducer cat \
-input /aws/input/sample.gz -output /aws/output/test
Output shows that there is one maptask only
Number of parallel running mappers is based on input splits and container resources.
Try NLineInputFormat
so each line of input file go to its own mapper:
-inputformat org.apache.hadoop.mapreduce.lib.input.NLineInputFormat
Number of lines per mapper can be set with configuration param
-Dmapreduce.input.lineinputformat.linespermap=N
If you want the job to have only 3 mappers, you should set N = file_lines / 3 + 1
If you also want them to run in parallel, make sure there's enough RAM and CPU resources to run 3 map tasks at once. Usually it is configured in YARN xml files in setting map container memory
. Remember Hadoop runs several auxiliary ecosystem processes like NameNode, DataNode, AppMaster, ResourceManager and others, that consume resources too.
Also I'm not sure about GZ file as input source, maybe you'll need to use plain text so NLineInputFormat could work.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.