简体   繁体   中英

Connecting HIVE in MRJob

The scenario is I need to process a file(Input) and for each records I need to check whether certain fields in input file are matching the fields stored in an Hadoop cluster.

We are in a thought of using MRJob to process the the input file and use HIVE to get data from hadoop cluster. I would like to know whether it is possible for me to connect HIVE inside a MRJob module. If so how to do that?

If not what would be the ideal approach to fulfill my requirement.

I am new to Hadoop, MRJob and Hive.

Please provide some suggestion.

"matching the fields stored in an Hadoop cluster." --> You mean that you need to search if the fields exists in this file too?

About how many files are there in total which you need to scan?

One solution is to load every single item in an HBase table and for every record in the input file, "GET"ing the record from the table. If the GET is successful then the record exists elsewhere in HDFS or else it doesn't. You would need a unique identifier for each HBase record and the same identifier should exist in your input file also.

You could connect to Hive also but the schema would need to be rigid in order for all your HDFS files to be able to be loaded into a single Hive table. HBase doesn't really care about columns (only ColumnFamilies needed). One more downside with MapReduce and Hive is that the speed will be low as compared to HBase (near real time).

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM