简体   繁体   中英

Hadoop Hive - best use cases to create a custom Hive Input and Output formats?

Just wanted to understand what best use cases to create a custom Hive InputFormat and Output format?

If anyone of you have created could you please let know when to decide to develop a custom Input / Output formats?

Thanks,

To make Hive varchar behave like Oracle varchar2:

While working on oracle to hadoop migration, we came across a setting in oracle where if the length of data for a varchar2 column exceeds the value defined in table DDL, oracle rejects the record.

Ex: Lets say we have a column 'name' in oracle and hadoop with max length 10 bytes

name varchar2(10 BYTE) - Oracle

name varchar(10) - Hive

If the value for name field="lengthgreaterthanten", oracle rejects the record as oracle applies schema during write time. Whereas hive reads "lengthgrea" ie 10 characters as Hive just applies the schema at the time of reading the records from HDFS.

To get over this problem we came up with a custom input format that checks the length of the varchar field by splitting on the delimiter. If the length is greater than the specified length, it continues to the next record. Else if the length is less than or equal to the specified length, the record is written to HDFS.

Hope this helps. Thanks

one of the various file formats used for Hive are RCFile, Parquet and ORC file formats. These file formats are columnar file format. This gives an advantage that when you reading large tables you don't have to read and process all the data. Most of the aggregation queries refer to only few columns rather than all of them. This speeds up your processing hugely.

Other application could be storing , reading and processing your custom input format, where data might be stored differently than csv structure. These might be binary files or any other structure.

You will have to follow the documentation to create input formats. For details you can follow the link: Custom InputFormat with Hive

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM