Right way to package and deploy Hadoop MapReduce job?

Question

I run Hadoop 2.2.0.2.0.6.0-101 on a local node, CentOS.

My MapReduce job compiles in Eclipse when I include neccessary jars from /usr/lib/hadoop and /usr/lib/hive as dependencies in Eclipse project. Finding neccessary jars is a real quest! And grep is my only tool for this job, doing things like grep -ri -l "FacebookService" /usr/lib/hadoop

Notwithstanding I get exceptions when I try to run my app on the same local node where I compile it. I am giving up trying to find neccessary jars - after one exception is fixed comes a new one.

Now, after having fixed about 10 exceptions by adding jars from /usr/lib/hadoop and /usr/lib/hive I got a real good one:

java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.

Interesting part: When I add all jars from these directories my program runs!

This last solution does not work in my case, as I need to create self-sufficient package to run my app on another, distributed Hadoop installation.

What is the right way to deploy Hadoop MapReduce job? How should I set Hadoop CLASSPATH to run MapReduce job on any node?

Answer 1

To reiterate what Vishal recommended : Use Maven for dependency management. Typical maven pom.xml for a MR project (simple) looks like this:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.test</groupId>
<artifactId>hadoop.test</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.0.0-cdh4.2.0</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

Thats the beauty : hadoop-client encapsulates all dependencies.

Coming to the question of running the generated jar file :

You can have 2 scenarios:

The m/c you are trying to run on is a part of the cluster ie hadoop installed and configured. In this case the command "hadoop jar <>" should include all hadoop related dependencies. You will have to add your depending jars.
The m/c does not have hadoop installed. In this case you can use maven to get the list of jars by checking the effective POM.

Hope it is clear.

Right way to package and deploy Hadoop MapReduce job?

Question

1 answers

solution1
0 ACCPTED 2014-03-25 18:36:27

Right way to package and deploy Hadoop MapReduce job?

Question

1 answers

solution1 0 ACCPTED 2014-03-25 18:36:27

solution1
0 ACCPTED 2014-03-25 18:36:27