简体   繁体   中英

Python packaging for hive/hadoop streaming

I have a hive query with custom mapper and reducer written in python. The mapper and reducer modules depend on some 3rd party modules/packages which are not installed on my cluster (installing them on the cluster is not an option). I realized this problem only after running the hive query when it failed saying that the xyz module was not found.

How do I package the whole thing so that I have all the dependencies (including transitive dependencies) available in my streaming job? How do I use such a packaging and import modules in my mapper and reducer?

The question is rather naive but I could not find an answer even after an hour of googling. Also, it's not just specific to hive but holds for hadoop streaming jobs in general when mapper/reducer is written in python.

This may be done by packaging the dependencies and the reducer script in a zip, and adding this zip as a resource in Hive.

Let's say the Python reducer script depends on package D1, which in turn depends on D2 (thus resolving OP's query on transitive dependencies), and both D1 and D2 are not installed on any machine in the cluster.

  • Package D1, D2, and the Python reducer script (let's call it reducer.py) in, say, dep.zip
  • Use this zip like in the following sample query:

    ADD ARCHIVE dep.zip;

    FROM (some_table) t1
    INSERT OVERWRITE TABLE t2
    REDUCE t1.col1, t1.col2 USING 'python dep.zip/dep/reducer.py' AS output;

Notice the first and the last line. Hive unzips the archive and creates these directories. The dep directory will hold the script and dependencies.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM