How do I install Hadoop and Pydoop on a fresh Ubuntu instance

Question

Most of the setup instructions I see are verbose. Is there a near script-like set of commands that we can just execute to set up Hadoop and Pydoop on an Ubuntu instance on Amazon EC2?

Answer 1

Another solution would be to use Juju (Ubuntu's service orchestration framework).

First install the Juju client on your standard computer:

sudo add-apt-repository ppa:juju/stable
sudo apt-get update && sudo apt-get install juju-core

(instructions for MacOS and Windows are also available here )

Then generate a configuration file

juju generate-config

And modify it with your preferred cloud credentials (AWS, Azure, GCE...). Based on the naming for m3.medium, I assume you use AWS hence follow these instructions

Note: The above has to be done only once.

Now bootstrap

 juju bootstrap amazon

Deploy a GUI (optional) like the demo available on the website

juju deploy --to 0 juju-gui && juju expose juju-gui

You'll find the URL of the GUI and password with:

juju api-endpoints | cut -f1 -d":"
cat ~/.juju/environments/amazon.jenv | grep pass

Note that the above steps are preliminary to any Juju deployment, and can be re-used everytime you want to spin the environment.

Now comes your use case with Hadoop. You have several options.

Just deploy 1 node of Hadoop

 juju deploy --constraints "cpu-cores=2 mem=4G root-disk=20G" hadoop

You can track the deployment with

juju debug-log

and get info about the new instances with

juju status

This is the only command you'll need to deploy Hadoop (you could consider Juju as an evolution of apt for complex systems)

Deploy a cluster of 3 nodes with HDFS and MapReduce

 juju deploy hadoop hadoop-master juju deploy hadoop hadoop-slavecluster juju add-unit -n 2 hadoop-slavecluster juju add-relation hadoop-master:namenode hadoop-slavecluster:datanode juju add-relation hadoop-master:resourcemanager hadoop-slavecluster:nodemanager

Scale out usage (separate HDFS & MapReduce, experimental)

 juju deploy hadoop hdfs-namenode juju deploy hadoop hdfs-datacluster juju add-unit -n 2 hdfs-datacluster juju add-relation hdfs-namenode:namenode hdfs-datacluster:datanode juju deploy hadoop mapred-resourcemanager juju deploy hadoop mapred-taskcluster juju add-unit -n 2 mapred-taskcluster juju add-relation mapred-resourcemanager:mapred-namenode hdfs-namenode:namenode juju add-relation mapred-taskcluster:mapred-namenode hdfs-namenode:namenode juju add-relation mapred-resourcemanager:resourcemanager mapred-taskcluster:nodemanager

For Pydoop, you'll have to deploy it manually as in the first answer (you have access to the Juju instances via "juju ssh "), or you can write a "charm" (a method for Juju to learn how to deploy pydoop).

Answer 2

Create an Ubuntu instance. I set mine up as an Ubuntu 14.04 Linux on a m3.medium spot instance with 20GB data store (delete on termination) with all ports open (to be on the safe side).

ssh into the server and copy-paste the commands below, paragraph by paragraph.

sudo apt-get -y update
sudo apt-get -y install default-jdk
ssh-keygen -t rsa -P ''                            # Press Enter when prompted

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

# See http://www.apache.org/dyn/closer.cgi/hadoop/common/ for latest file version
wget http://download.nextag.com/apache/hadoop/common/current/hadoop-2.6.0.tar.gz
tar xfz hadoop-2.6.0.tar.gz

# Replace the folder/file names for your system
export HADOOP_PREFIX=/home/ubuntu/hadoop-2.6.0
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Configure Hadoop

# Add these into the Hadoop env
cat >> $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh <<EOF
export JAVA_HOME=${JAVA_HOME}
export HADOOP_PREFIX=${HADOOP_PREFIX}
EOF

cat > $HADOOP_PREFIX/etc/hadoop/core-site.xml <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
EOF

cat > $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
EOF

Run a Standalone node and add files to it

# Format and start HDFS
$HADOOP_PREFIX/bin/hdfs namenode -format
$HADOOP_PREFIX/sbin/start-dfs.sh

# Create a folder
$HADOOP_PREFIX/bin/hdfs dfs -mkdir /user
$HADOOP_PREFIX/bin/hdfs dfs -mkdir /user/sample

# Copy input files into HDFS
$HADOOP_PREFIX/bin/hdfs dfs -put $HADOOP_PREFIX/etc/hadoop/*.xml /user/sample/

# Run example
$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep /user/sample /user/output 'dfs[a-z.]+'

Install Pydoop

sudo apt-get -y install build-essential python-dev
sudo bash    # To avoid sudo pip install not geting the env variables
export HADOOP_HOME=/home/ubuntu/hadoop-2.6.0
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
pip install pydoop

Test Pydoop with this Python script:

import pydoop.hdfs
hdfs = pydoop.hdfs.hdfs()
hdfs.list_directory('/user/sample')
# This lists all files under /user/sample

How do I install Hadoop and Pydoop on a fresh Ubuntu instance

Question

2 answers

solution1
9 2015-05-19 10:08:07

solution2
2 ACCPTED 2015-04-12 10:51:47

How do I install Hadoop and Pydoop on a fresh Ubuntu instance

Question

2 answers

solution1 9 2015-05-19 10:08:07

solution2 2 ACCPTED 2015-04-12 10:51:47

solution1
9 2015-05-19 10:08:07

solution2
2 ACCPTED 2015-04-12 10:51:47