简体   繁体   English

以Apache Parquet格式写入数据

[英]Write data in Apache Parquet format

I'm having a scheduler that gets our cluster metrics and writes the data onto a HDFS file using an older version of the Cloudera API. 我正在使用一个调度程序来获取我们的集群指标,并使用旧版本的Cloudera API将数据写入HDFS文件。 But recently, we updated our JARs and the original code errors with an exception. 但最近,我们更新了我们的JAR和原始代码错误,但有一个例外。

java.lang.ClassCastException: org.apache.hadoop.io.ArrayWritable cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:116)
at parquet.hadoop.ParquetWriter.write(ParquetWriter.java:324)

I need help in using the ParquetHiveRecord class write the data (which are POJOs) in parquet format. 我需要帮助使用ParquetHiveRecord类以镶木地板格式写入数据(即POJO)。

Code sample below: 代码示例如下:

Writable[] values = new Writable[20];
... // populate values with all values
ArrayWritable value = new ArrayWritable(Writable.class, values);
writer.write(value); // <-- Getting exception here

Details of "writer" (of type ParquetWriter): “作家”(ParquetWriter类型)的详细信息:

MessageType schema = MessageTypeParser.parseMessageType(SCHEMA); // SCHEMA is a string with our schema definition
ParquetWriter<ArrayWritable> writer = new ParquetWriter<ArrayWritable>(fileName, new 
DataWritableWriteSupport() {
    @Override
    public WriteContext init(Configuration conf) {
        if (conf.get(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA) == null)
            conf.set(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA, schema.toString());
    }
});

Also, we were using CDH and CM 5.5.1 before, now using 5.8.3 此外,我们之前使用的是CDH和CM 5.5.1,现在使用的是5.8.3

Thanks! 谢谢!

I think you need to use DataWritableWriter rather than ParquetWriter . 我认为你需要使用DataWritableWriter而不是ParquetWriter The class cast exception indicates the write support class is expecting an instance of ParquetHiveRecord instead of ArrayWritable . ParquetHiveRecord异常表示写支持类期望ParquetHiveRecord的实例而不是ArrayWritable DataWritableWriter likely breaks down the individual records in ArrayWritable to individual messages in the form of ParquetHiveRecord and sends each to the write support. DataWritableWriter可能会以ArrayWritable的形式将ParquetHiveRecord的各个记录ArrayWritable为单个消息,并将每个记录发送到写入支持。

Parquet is sort of mind bending at times. 实木复合地板有时会弯曲。 :) :)

Looking at the code of the DataWritableWriteSupport class: https ://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java You can see it is using the DataWritableWriter, hence you do not need to create an instance of DataWritableWriter, the idea of Write support is that you will be able to write different formats to parquet. 查看DataWritableWriteSupport类的代码:https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java您可以看到它正在使用DataWritableWriter,因此您不需要创建DataWritableWriter的实例,Write支持的想法是您可以将不同的格式写入镶木地板。

What you do need is to wrap your writables in ParquetHiveRecord 您需要的是将您的可写文件包装在ParquetHiveRecord中

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM