简体繁体中英

Connecting HIVE in MRJob

原文 2016-11-28 23:09:09 3 1 hadoop/ hive/ mrjob

The scenario is I need to process a file(Input) and for each records I need to check whether certain fields in input file are matching the fields stored in an Hadoop cluster.

We are in a thought of using MRJob to process the the input file and use HIVE to get data from hadoop cluster. I would like to know whether it is possible for me to connect HIVE inside a MRJob module. If so how to do that?

If not what would be the ideal approach to fulfill my requirement.

I am new to Hadoop, MRJob and Hive.

Please provide some suggestion.

1 answers

"matching the fields stored in an Hadoop cluster." --> You mean that you need to search if the fields exists in this file too?

About how many files are there in total which you need to scan?

One solution is to load every single item in an HBase table and for every record in the input file, "GET"ing the record from the table. If the GET is successful then the record exists elsewhere in HDFS or else it doesn't. You would need a unique identifier for each HBase record and the same identifier should exist in your input file also.

You could connect to Hive also but the schema would need to be rigid in order for all your HDFS files to be able to be loaded into a single Hive table. HBase doesn't really care about columns (only ColumnFamilies needed). One more downside with MapReduce and Hive is that the speed will be low as compared to HBase (near real time).

Hope this helps.

Can I use mrjob python library on partitioned hive tables?

Connecting Apache Superset with Hive

Connecting to Hive using Beeline

Connecting to Hive in R

Connecting Cassandra with Hive

Connecting to Hive Database with DBeaver

Beeline error connecting to Hive2

Connecting R to Hive on a Remote Server

Connecting to metastore in hive after upgrade

connecting no authentication apache hive with MicroStrategy

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Can I use mrjob python library on partitioned hive tables? Connecting Apache Superset with Hive Connecting to Hive using Beeline Connecting to Hive in R Connecting Cassandra with Hive Connecting to Hive Database with DBeaver Beeline error connecting to Hive2 Connecting R to Hive on a Remote Server Connecting to metastore in hive after upgrade connecting no authentication apache hive with MicroStrategy

Related Tags

Connecting HIVE in MRJob

Question

1 answers

solution1 0 2016-11-29 00:18:12

solution1
0 2016-11-29 00:18:12