how to increase the number of map tasks in hadoop and how to get total time taken by hadoop mapreduce job

Question

I have a dataset which im trying to analyze in hadoop. As far as I have done, its running smoothly in small amount of data.

1st Query:
I want to test this on large data and find out how much time it takes to complete the task when the file size is increased. How to get how many seconds it takes to complete the task? Is there any cmd line syntax or as such?

2nd Query:
dfs.replication is set to 1 in hdfs-core.xml file. Does it only replicates the input data, or does it have some effect on map reduce job?

3rd Query:
Now, I have a single-node hadoop cluster. How to know the exact number of mappers it produces for a given input file and how can I change the no. of mappers? Actually, I want to get the time it takes to complete the tasks under different no. of mappers.

For example: First, I want to test the data with 10 mappers, then 20 and so on, so that I can get how much time it takes to complete the task under different no. of mappers.

Answer 1

3rd query :

You can play around with block size .

By default if you don't configure block size in hadoop 1.x its 64 MB

Hadoop 2.x its 128 MB

Suppose you have file of 1 GB if block size is 64 MB ,so by default if you have any configured anything for input split size then your input split size would be equivalent to block size so 16 splits of 64 mb each would be there for 1 GB and corresponding 1 mapper of each split means 16 mapper would be invoked for 1 Gb of data

if you change block size to 128 mb so 8 mapper would be used similarly for 256mb block size 4 and for 512 mb block size 2 mapper would be used .

2nd Query : Replication factor can improve your map -reduce task performance because if data would be replicated properly so task tracker can straight way run on the block otherwise it will have to copy that block from other node would can use network bandwidth and hence degrade performance .

1st Query :

Once any job completes at end of that job it has all the statistics like how many mappers and how many reducers were used ,how many bytes written and how long it took to execute and it has all the details .

Answer 2

1st Query
I'm not sure about the cmd syntax, but you can use the java api itself after the job completion. eg :

job.waitForCompletion(false);
if(job.isSuccessful()){
   System.out.println("completionTime :" 
    + (job.getFinishTime() - job.getStartTime())/1000 + "s");
}

2nd Query
It will effect the job performance. Because now the job won't be able take advantage of locality of data as much as it would when the replication factor was 3. Data have to be transferred to taskTrackers where slots are available, thus ending up in more network IO & degraded performance.

3rd Query
The number of mappers is always equal to the number of input spits. The orthodox way is to write a custom InputFormat which spilts the data file based on the specified criteria. Say you have a 1GB file & you want 5 mappers, just let the InputFormat to do spilts on 200MB (which will consume more than 3 blocks on default 64 MB block size).

On the other hand, use the default InputFormat and split the file manually to the number of mappers you want before submitting the job. For this the constraint is that each sub-file should have a size less than or equal to the block size. So for 5 mappers you can use upto a total 5*64=320MB fileSize.

The third way to change blocksize can solve the issue without these troubles but is not advisable at all. Because it requires the cluster restart each time.

UPDATE
The easiest, and most probably the best solution for 3rd query is to use the mapred.max.split.size configurations per job basis. To run 5 maps for a 1GB file, before job submission do something like :

conf.set("mapred.max.split.size", "209715200"); // 200*1024^2 bytes

Pretty simple, ha. And again there is another property mapred.min.split.size , still I'm confused a bit about its use. This SE post may help you in this regard.

Instead you may also take advantage of the -D option when running the job. eg :

hadoop jar job.jar com.test.Main -Dmapred.map.max.split.size=209715200

NB : These properties get deprecated in Hadoop 2.5.0. Have a look if are using it .

Answer 3

@namanamu,
Query 1 :
if you are using a seperate driver Class, then you can use Java timer to know how much time it is taking by adding your main code between long start = System.currentTimeMillis(); and long stop = System.currentTimeMillis(); and time taken is (stop-start)/1000 seconds.

Query 3 : When you execute a job through command-line using hadoop jar myfile.jar , in the end you will find all properties like no. of Mappers, Reducers, Input groups, Reduce Groups and all other info.

how to increase the number of map tasks in hadoop and how to get total time taken by hadoop mapreduce job

Question

3 answers

solution1
1 2014-10-27 06:40:32

solution2
1 2014-10-27 07:48:48

solution3
0 2014-10-30 17:24:51

how to increase the number of map tasks in hadoop and how to get total time taken by hadoop mapreduce job

Question

3 answers

solution1 1 2014-10-27 06:40:32

solution2 1 2014-10-27 07:48:48

solution3 0 2014-10-30 17:24:51

solution1
1 2014-10-27 06:40:32

solution2
1 2014-10-27 07:48:48

solution3
0 2014-10-30 17:24:51