简体繁体中英

Can I use mrjob python library on partitioned hive tables?

原文 2014-09-17 11:57:23 6 1 python/ hadoop/ streaming/ hive/ mrjob

I have a user access to hadoop server/cluster containing data that is stored solely in partitioned tables/files in hive (avro). I was wondering if I can perform mapreduce using python mrjob on these tables? So far I have been testing mrjob locally on text files stored on CDH5 and I am impressed by the ease of development.

After some research I discovered there is a library called HCatalog, but as far as I know it's not available for python (only Java). Unfortunately, I do not have much time to learn Java and I would like to stick to Python.

Do you know any way to run mrjob on hive stored data?

If this is impossible, is there a way to stream python-written mapreduce code to hive? (I would rather not upload mapreduce python files to hive)

1 answers

As Alex stated currently Mr.Job does not work with avro formated files. However, there is a way to perform python code on hive tables directly (no Mr.Job needed, unfortunatelly with loss of flexibility). Eventually, I managed to add python file as a resource to hive by executing "ADD FILE mapper.py" and performing SELECT clause with TRANSFORM ... USING ...., storing the results of a mapper in a separate table. Example Hive query:

INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data;

Full example is available here (at the bottom): link

How to create date partitioned tables in GBQ? Can you use python?

Python: How can I index in MapReduce(MRJob)?

python - How to use map reduce MRJob

How to use external library in python UDF on hive?

How can I use s3 object names as inputs to an MRJob mapper, but not the s3 objects themselves?

How can I write an iteration in Python using mrjob mapper reducer, for which the counter is a part of the computation in the loop?

Python - Can I use blinker library for this purpose?

Can I use library abstractions in python?

MRJob Sort in Python

Can I use exported AutoML Tables model in Python?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to create date partitioned tables in GBQ? Can you use python? Python: How can I index in MapReduce(MRJob)? python - How to use map reduce MRJob How to use external library in python UDF on hive? How can I use s3 object names as inputs to an MRJob mapper, but not the s3 objects themselves? How can I write an iteration in Python using mrjob mapper reducer, for which the counter is a part of the computation in the loop? Python - Can I use blinker library for this purpose? Can I use library abstractions in python? MRJob Sort in Python Can I use exported AutoML Tables model in Python?

Related Tags

Can I use mrjob python library on partitioned hive tables?

Question

1 answers

solution1 0 2014-10-15 08:13:14

solution1
0 2014-10-15 08:13:14