简体   繁体   English

RCFile-发出GZip压缩的int列

[英]RCFile - emitting GZip compressed int columns

For some reason, Hive is not recognizing columns emitted as integers, but does recognize columns emitted as strings. 由于某些原因,Hive不能识别以整数形式发出的列,但是会识别以字符串形式发出的列。

Is there something about Hive or RCFile or GZ that is preventing proper rendering of int? Hive或RCFile或GZ是否存在阻止int正确渲染的问题?

My Hive DDL looks like: 我的Hive DDL看起来像:

create external table if not exists db.table (intField int, strField string) stored as rcfile location '/path/to/my/data';

And the relevant portion of my Java looks like: Java的相关部分如下所示:

BytesRefArrayWritable dataWrite = new BytesRefArrayWritable(2);
byte[] byteArray;
BytesRefWritable bytesRefWritable = new BytesRefWritable();                             intWritable.set(myObj.getIntField());
byteArray = WritableUtils.toByteArray(intWritable.get());
bytesRefWritable.set(byteArray, 0, byteArray.length);
dataWrite.set(0, bytesRefWritable);  // sets int field as column 0


bytesRefWritable = new BytesRefWritable();
textWritable.set(myObj.getStrField());
bytesRefWritable.set(textWritable.getBytes(), 0, textWritable.getLength());
dataWrite.set(1, bytesRefWritable);  // sets str field as column 1

The code runs fine, and through logging I can see the various Writables have bytes within them. 代码运行良好,通过记录,我可以看到各种Writables都有字节。

Hive can read the external table as well, but the int field shows up as NULL , indicating some error . Hive也可以读取外部表,但是int字段显示为NULL表明有些错误

SELECT * from db.table;

OK
NULL    my string field
Time taken: 0.647 seconds

Any idea what might be going on here? 知道这里可能会发生什么吗?

So, I'm not sure exactly why this is the case, but I got it working using the following method: 因此,我不确定为什么会这样,但是我可以使用以下方法使其工作:

In the code that writes the byte array representing the integer value, instead of using WritableUtils.toByteArray() , I instead Text.set(Integer.toString(intVal)).getBytes() . 在编写表示整数值的字节数组的代码中,我不使用WritableUtils.toByteArray() ,而是使用Text.set(Integer.toString(intVal)).getBytes()

In other words, I convert the integer to its String representation, and use the Text writable object to get the byte array as if it were a string. 换句话说,我将整数转换为其String表示形式,并使用Text可写对象来获取字节数组,就好像它是字符串一样。

Then, in my Hive DDL, I can call the column an int and it interprets it correctly. 然后,在我的Hive DDL中,我可以将列称为int并正确解释它。

I'm not sure what was initially causing the problem, be it a bug in WritableUtils , some incompatibility with compressed integer byte arrays, or a faulty understanding of how this stuff works on my part. 我不确定最初是什么原因引起的问题,可能是WritableUtils的错误,与压缩整数字节数组不兼容,还是对这部分内容的理解不正确。 In any event, the solution described above successfully meets the task's needs. 无论如何,上述解决方案都能成功满足任务的需求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM