简体   繁体   中英

Hadoop 2.7: MapReduce task's total time using streaming API

I am running Hadoop 2.7.1 on a local cluster (all nodes running Ubuntu 14.x or above). My mapreduce programs are written in Python and I am using the streaming API to run the task. I want to find out the total time that all the mapred tasks over all the nodes are taking. How to do that? I am not able to find the job files. (Perhaps removed from Hadoop 2.x onwards).

If you're looking for the sum of all the aggregate time spent in all your tasks, you'll likely want to look at the counters. These can be viewed on the job history server as well clicking on Counters on the left after drilling into individual jobs, or alternatively you can do this more programmatically using mapred job commands, for example, to print out all the summary statuses of SUCCEEDED jobs:

mapred job -list all | grep SUCCEEDED | awk '{ print $1 }' | \
    xargs -n 1 mapred job -status

The closest to "aggregate wall time" that counts as consumed time on your cluster would be "time spent in occupied slots", which is SLOTS_MILLIS_MAPS and SLOTS_MILLIS_REDUCES :

mapred job -list all | grep SUCCEEDED | awk '{ print $1 }' | \
    xargs -n 1 -i mapred job -counter {} org.apache.hadoop.mapreduce.JobCounter SLOTS_MILLIS_MAPS
mapred job -list all | grep SUCCEEDED | awk '{ print $1 }' | \
    xargs -n 1 -i mapred job -counter {} org.apache.hadoop.mapreduce.JobCounter SLOTS_MILLIS_REDUCES

total time that all the mapred tasks is job elapsed time. You can look it in hadoop web interface (click specified job). http://ip_address:8088/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM