简体繁体中英

Python packaging for hive/hadoop streaming

原文 2015-03-17 19:28:37 8 1 python/ hadoop/ mapreduce/ hadoop-streaming

I have a hive query with custom mapper and reducer written in python. The mapper and reducer modules depend on some 3rd party modules/packages which are not installed on my cluster (installing them on the cluster is not an option). I realized this problem only after running the hive query when it failed saying that the xyz module was not found.

How do I package the whole thing so that I have all the dependencies (including transitive dependencies) available in my streaming job? How do I use such a packaging and import modules in my mapper and reducer?

The question is rather naive but I could not find an answer even after an hour of googling. Also, it's not just specific to hive but holds for hadoop streaming jobs in general when mapper/reducer is written in python.

1 answers

This may be done by packaging the dependencies and the reducer script in a zip, and adding this zip as a resource in Hive.

Let's say the Python reducer script depends on package D1, which in turn depends on D2 (thus resolving OP's query on transitive dependencies), and both D1 and D2 are not installed on any machine in the cluster.

Package D1, D2, and the Python reducer script (let's call it reducer.py) in, say, dep.zip
Use this zip like in the following sample query:
ADD ARCHIVE dep.zip;
FROM (some_table) t1
INSERT OVERWRITE TABLE t2
REDUCE t1.col1, t1.col2 USING 'python dep.zip/dep/reducer.py' AS output;

Notice the first and the last line. Hive unzips the archive and creates these directories. The dep directory will hold the script and dependencies.

Hadoop Streaming using python

python streaming with hadoop is not working

hadoop streaming with python modules

Hadoop streaming to python using mongo-hadoop

Hadoop streaming with multiple python files

Hadoop streaming with private python interpreter

Hadoop Streaming job failure “Python”

Hadoop streaming to invoke python script

SequenceFile format with Hadoop Streaming Python

Using files in Hadoop Streaming with Python

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Hadoop Streaming using python python streaming with hadoop is not working hadoop streaming with python modules Hadoop streaming to python using mongo-hadoop Hadoop streaming with multiple python files Hadoop streaming with private python interpreter Hadoop Streaming job failure “Python” Hadoop streaming to invoke python script SequenceFile format with Hadoop Streaming Python Using files in Hadoop Streaming with Python

Related Tags

Python packaging for hive/hadoop streaming

Question

1 answers

solution1 0 2015-12-15 19:05:20

solution1
0 2015-12-15 19:05:20