简体繁体 English

Hadoop Hive-创建自定义Hive输入和输出格式的最佳用例？

[英]Hadoop Hive - best use cases to create a custom Hive Input and Output formats?

原文 2013-09-12 17:03:37 1 2 hadoop/ hive

Just wanted to understand what best use cases to create a custom Hive InputFormat and Output format? 仅想了解创建自定义Hive InputFormat和Output格式的最佳用例是什么？

If anyone of you have created could you please let know when to decide to develop a custom Input / Output formats? 如果您创建过任何人，请告诉我们何时决定开发自定义输入/输出格式？

Thanks, 谢谢，

2 个解决方案

To make Hive varchar behave like Oracle varchar2: 要使Hive varchar的行为类似于Oracle varchar2：

While working on oracle to hadoop migration, we came across a setting in oracle where if the length of data for a varchar2 column exceeds the value defined in table DDL, oracle rejects the record. 在使用oracle进行hadoop迁移时，我们遇到了oracle中的一个设置，如果varchar2列的数据长度超过表DDL中定义的值，则oracle拒绝该记录。

Ex: Lets say we have a column 'name' in oracle and hadoop with max length 10 bytes 例如：假设我们在oracle和hadoop中有一列“名称”，最大长度为10个字节

name varchar2(10 BYTE) - Oracle 名称varchar2（10 BYTE）-Oracle

name varchar(10) - Hive 名称varchar（10）-蜂巢

If the value for name field="lengthgreaterthanten", oracle rejects the record as oracle applies schema during write time. 如果name字段的值=“ lengthgreaterthanten”，则oracle将拒绝该记录，因为oracle在写入期间应用了架构。 Whereas hive reads "lengthgrea" ie 10 characters as Hive just applies the schema at the time of reading the records from HDFS. 蜂巢读取“ lengthgrea”，即10个字符，因为蜂巢仅在从HDFS读取记录时应用架构。

To get over this problem we came up with a custom input format that checks the length of the varchar field by splitting on the delimiter. 为了解决这个问题，我们提出了一种自定义输入格式，该格式通过分隔符来检查varchar字段的长度。 If the length is greater than the specified length, it continues to the next record. 如果长度大于指定的长度，它将继续到下一条记录。 Else if the length is less than or equal to the specified length, the record is written to HDFS. 否则，如果长度小于或等于指定的长度，则将记录写入HDFS。

Hope this helps. 希望这可以帮助。 Thanks 谢谢

one of the various file formats used for Hive are RCFile, Parquet and ORC file formats. 用于Hive的各种文件格式之一是RCFile，Parquet和ORC文件格式。 These file formats are columnar file format. 这些文件格式是列式文件格式。 This gives an advantage that when you reading large tables you don't have to read and process all the data. 这样做的好处是，在读取大型表时，您不必读取和处理所有数据。 Most of the aggregation queries refer to only few columns rather than all of them. 大多数聚合查询仅引用几列，而不是全部。 This speeds up your processing hugely. 这极大地加快了处理速度。

Other application could be storing , reading and processing your custom input format, where data might be stored differently than csv structure. 其他应用程序可能会存储，读取和处理您的自定义输入格式，其中数据的存储方式可能与csv结构的存储方式不同。 These might be binary files or any other structure. 这些可能是二进制文件或任何其他结构。

You will have to follow the documentation to create input formats. 您将必须遵循文档来创建输入格式。 For details you can follow the link: Custom InputFormat with Hive 有关详细信息，您可以单击以下链接： Hive的Custom InputFormat