Hadoop 1.2.1 - multinode cluster - Reducer phase hangs for Wordcount program?

Question

My question may sound redundant here but the solution to the earlier questions were all ad-hoc. few I have tried but no luck yet.

Acutally, I am working on hadoop-1.2.1(on ubuntu 14), Initially I had single node set-up and there I ran the WordCount program succesfully. Then I added one more node to it according to this tutorial. It started successfully, without any errors, But now when I am running the same WordCount program it is hanging in reduce phase. I looked at task-tracker logs, they are as given below :-

INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201509110037_0001_m_000002_0 task's state:UNASSIGNED
INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201509110037_0001_m_000002_0 which needs 1 slots
INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 2 and trying to launch attempt_201509110037_0001_m_000002_0 which needs 1 slots
INFO org.apache.hadoop.mapred.JobLocalizer: Initializing user hadoopuser on this TT.
INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201509110037_0001_m_18975496
INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201509110037_0001_m_18975496 spawned.
INFO org.apache.hadoop.mapred.TaskController: Writing commands to /app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoopuser/jobcache/job_201509110037_0001/attempt_201509110037_0001_m_000002_0/taskjvm.sh
INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201509110037_0001_m_18975496 given task: attempt_201509110037_0001_m_000002_0
INFO org.apache.hadoop.mapred.TaskTracker: attempt_201509110037_0001_m_000002_0 0.0% hdfs://HadoopMaster:54310/input/file02:25+3
INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201509110037_0001_m_000002_0 is done.
INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201509110037_0001_m_000002_0  was 6
INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 2
INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201509110037_0001_m_18975496 exited with exit code 0. Number of tasks it ran: 1
INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201509110037_0001_r_000000_0 task's state:UNASSIGNED
INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201509110037_0001_r_000000_0 which needs 1 slots
INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 2 and trying to launch attempt_201509110037_0001_r_000000_0 which needs 1 slots
INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hadoopuser for UID 10 from the native implementation
INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201509110037_0001_r_18975496
INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201509110037_0001_r_18975496 spawned.
INFO org.apache.hadoop.mapred.TaskController: Writing commands to /app/hadoop/tmp/mapred/local/ttprivate/taskTracker/hadoopuser/jobcache/job_201509110037_0001/attempt_201509110037_0001_r_000000_0/taskjvm.sh
INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201509110037_0001_r_18975496 given task: attempt_201509110037_0001_r_000000_0
INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 127.0.1.1:500, dest: 127.0.0.1:55946, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201509110037_0001_m_000002_0, duration: 7129894
INFO org.apache.hadoop.mapred.TaskTracker: attempt_201509110037_0001_r_000000_0 0.11111112% reduce > copy (1 of 3 at 0.00 MB/s) > 
INFO org.apache.hadoop.mapred.TaskTracker: attempt_201509110037_0001_r_000000_0 0.11111112% reduce > copy (1 of 3 at 0.00 MB/s) > 
INFO org.apache.hadoop.mapred.TaskTracker: attempt_201509110037_0001_r_000000_0 0.11111112% reduce > copy (1 of 3 at 0.00 MB/s) > 
INFO org.apache.hadoop.mapred.TaskTracker: attempt_201509110037_0001_r_000000_0 0.11111112% reduce > copy (1 of 3 at 0.00 MB/s) > 
INFO org.apache.hadoop.mapred.TaskTracker: attempt_201509110037_0001_r_000000_0 0.11111112% reduce > copy (1 of 3 at 0.00 MB/s) > 
INFO org.apache.hadoop.mapred.TaskTracker: attempt_201509110037_0001_r_000000_0 0.11111112% reduce > copy (1 of 3 at 0.00 MB/s) >

Also on the console where I am running the program It hangs at -

00:39:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
00:39:24 INFO util.NativeCodeLoader: Loaded the native-hadoop library
00:39:24 WARN snappy.LoadSnappy: Snappy native library not loaded
00:39:24 INFO mapred.FileInputFormat: Total input paths to process : 2
00:39:24 INFO mapred.JobClient: Running job: job_201509110037_0001
00:39:25 INFO mapred.JobClient:  map 0% reduce 0%
00:39:28 INFO mapred.JobClient:  map 100% reduce 0%
00:39:35 INFO mapred.JobClient:  map 100% reduce 11%

and my configuration files are as follows :-

//core-site.xml

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://HadoopMaster:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>

//hdfs-site.xml

<configuration>
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>   
</configuration>

//mapred-site.xml

<configuration>
<property>
  <name>mapred.job.tracker</name>
  <value>HadoopMaster:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
<property>
<name>mapred.reduce.slowstart.completed.maps</name>
  <value>0.80</value>
</property>    
</configuration>

/etc/hosts

127.0.0.1 localhost
127.0.1.1 M-1947

#HADOOP CLUSTER SETUP
172.50.88.54 HadoopMaster
172.50.88.60 HadoopSlave1

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

/etc/hostname

M-1947

//masters

HadoopMaster

//slaves

HadoopMaster

HadoopSlave1

I have been struggling with it for long, any help is appreciated. Thanks !

Answer 1

Got it fixed.. although, the same issue has multiple questions on the forums but the verified solution according to me is that hostname resolution for the any node in the cluster should be correct (moreover this issue doesnot depend upon the size of cluster).

Actually it is the issue with dns-lookup , ensure one make the below changes to resolve the above issue -

try printing hostname on each machine using '$ hostname'
check that the hostname printed for each machine is same as the entry made in master/slaves file for respective machine.
If it doesn't matches then rename the host by making changes in the /etc/hostname file and reboot the system.

Example :-

in /etc/hosts file (let's say on Master machine of hadoop cluster)

127.0.0.1 localhost

127.0.1.1 john-machine

#Hadoop cluster

172.50.88.21 HadoopMaster

172.50.88.22 HadoopSlave1

172.50.88.23 HadoopSlave2

then it's -> /etc/hostname file (on master machine) should contain the following entry (for the above issue to be resolved)

HadoopMaster

similarly verify the /etc/hostname files of the each slave node.

Hadoop 1.2.1 - multinode cluster - Reducer phase hangs for Wordcount program?

Question

1 answers

solution1
1 2015-09-13 15:25:45

Hadoop 1.2.1 - multinode cluster - Reducer phase hangs for Wordcount program?

Question

1 answers

solution1 1 2015-09-13 15:25:45

solution1
1 2015-09-13 15:25:45