简体   繁体   中英

change mapper to number of cores on each worker in hadoop 2.7.3

By default, there can be two mapper for a job in hadoop 2.7.3. I have a cluster of 2 system with 4 core available on each. One is master and one is worker. Now I want to run 3 map tasks worker node. Can I do it? I am using hadoop streaming to run a job. So what argument should I set for this purpose. Also I want to set one input(line) to one mapper only. What should be the format of arguments. My current command that does not fullfil the job is

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar\
    -D mapred.output.compress=true \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
    -files test.py   -mapper test.py    -reducer cat \
    -input /aws/input/sample.gz   -output /aws/output/test

Output shows that there is one maptask only

Number of parallel running mappers is based on input splits and container resources.

Try NLineInputFormat so each line of input file go to its own mapper:

-inputformat org.apache.hadoop.mapreduce.lib.input.NLineInputFormat

Number of lines per mapper can be set with configuration param

-Dmapreduce.input.lineinputformat.linespermap=N

If you want the job to have only 3 mappers, you should set N = file_lines / 3 + 1

If you also want them to run in parallel, make sure there's enough RAM and CPU resources to run 3 map tasks at once. Usually it is configured in YARN xml files in setting map container memory . Remember Hadoop runs several auxiliary ecosystem processes like NameNode, DataNode, AppMaster, ResourceManager and others, that consume resources too.

Also I'm not sure about GZ file as input source, maybe you'll need to use plain text so NLineInputFormat could work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM