Successful Oryx Install on Google Compute Engine?

Question

I am trying to get Oryx up and running on Google Compute Engine. I created a new instance and installed Oryx via:

git clone https://github.com/cloudera/oryx.git
cd oryx
mvn -DskipTests install

and saved this install as an image on Google Compute Engine ("oryx-image").

Finding issues with Oryx and the Google File System ( Hadoop 2.4.1 and Google Cloud Storage connector for Hadoop ) I have been using hdfs:// as the default file system.

Finding issues with the default Hadoop package launched on Google Compute Engine (eg, no Snappy libraries, which are needed for the default Oryx configuration), I have also tried creating my own Hadoop 2.4.1 tarball with Snappy included following these instructions: How to enable Snappy/Snappy Codec over hadoop cluster for Google Compute Engine (side note: is the jdk version described here sufficient for oryx?). I have then used my saved image with oryx installed ("oryx-image"):

./bdutil --bucket <some-bucket> --image oryx-image -n $number \
    --env_var_files hadoop2_env.sh --default_fs hdfs

and my saved Hadoop tarball:

# File: hadoop2_env.sh
HADOOP_TARBALL_URI="gs://<some-bucket>/hadoop-2.4.1.tar.gz"

to deploy a Hadoop 2.4.1 (with Snappy) cluster (with default file system = hdfs://) on Google Compute Engine. Still no luck.

I can successfully run test Hadoop jobs on GCE, test Snappy implementations on GCE (see second link ), and test Oryx jobs on GCE locally from the master node:

# File: oryx.conf
model.local-data = true
model.local-computation = true

The only issue is getting Oryx to successfully run on Google Compute Engine with data in either hdfs:// or gs://.

I have found many varying instructions for environmental variable changes, etc., and I don't know which ones are necessary, and which ones may be leading to more problems. I was wondering if there is documentation on installing/running oryx on GCE. Perhaps someone has gone through the same process already and can offer instruction and/or at least confirm a successful install?

The instructions (found in second link ) for installing Hadoop 2.4.1 with Snappy on GCE were superb. I was hoping to find something with that level of detail regarding all the steps necessary to make oryx work on GCE from scratch.

Thanks!

Answer 1

I don't know if this is a direct answer, but I can comment on a few points here. I think a lot of the issues here are getting a standard Hadoop installation up and running on GCE.

I have never run it on GCE, but this shouldn't directly matter whether it runs on bare metal or GCE or EC2. It just uses Hadoop. Yes it does assume Hadoop though, and HDFS. (I think the hard-coding hdfs:// could be removed, sure; I don't know if this would make it work with non-HDFS file systems.) So if GCE has a different filesystem by default, yes your best bet is to use HDFS.

I suppose I think of Snappy as a required part of a Hadoop installation. If you're installing Hadoop by hand, yes I think you have to take a few more steps. This is why I'd recommend a (free, open source) distro that takes care of this for you.

It should also set up things like HADOOP_CONF_DIR for you, which, hm, I also tend to think of as a required part of a Hadoop setup in general, at least on the client side.

Any version of Java 6 or later is fine.

Is it possible to try a distro? it may be much less pain. I'm sorry I don't have further instructions here but it seems like a GCE<->Hadoop issue more that Hadoop<->Oryx. If the app can change in ways to make it accommodate GCE better I can do that.

Answer 2

I found a not-so-elegant "solution" to this problem. The standard issue Hadoop-2.4.1 provided by Google Compute Engine did actually have snappy libraries, they just weren't in the "right" place. So I copied all of the snappy library files from their default location (/usr/lib/) to the java library directory. Obviously only one of these lines are needed, but I haven't taken the time to discover which one is the right one:

sudo cp /usr/lib/lib* /usr/local/lib
sudo cp /usr/lib/lib* /usr/java/jdk1.7.0_55/lib/amd64/jli
sudo cp /usr/lib/lib* /usr/java/jdk1.7.0_55/lib/amd64
sudo cp /usr/lib/lib* /usr/java/jdk1.7.0_55/lib

And of course this isn't so much as a solution, as a work around. I suppose adding the snappy library directory to the correct path would work too.

Successful Oryx Install on Google Compute Engine?

Question

2 answers

solution1
2 2014-10-17 18:41:36

solution2
0 2014-10-29 20:38:20

Successful Oryx Install on Google Compute Engine?

Question

2 answers

solution1 2 2014-10-17 18:41:36

solution2 0 2014-10-29 20:38:20

solution1
2 2014-10-17 18:41:36

solution2
0 2014-10-29 20:38:20