简体   繁体   中英

Write data in Apache Parquet format

I'm having a scheduler that gets our cluster metrics and writes the data onto a HDFS file using an older version of the Cloudera API. But recently, we updated our JARs and the original code errors with an exception.

java.lang.ClassCastException: org.apache.hadoop.io.ArrayWritable cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:116)
at parquet.hadoop.ParquetWriter.write(ParquetWriter.java:324)

I need help in using the ParquetHiveRecord class write the data (which are POJOs) in parquet format.

Code sample below:

Writable[] values = new Writable[20];
... // populate values with all values
ArrayWritable value = new ArrayWritable(Writable.class, values);
writer.write(value); // <-- Getting exception here

Details of "writer" (of type ParquetWriter):

MessageType schema = MessageTypeParser.parseMessageType(SCHEMA); // SCHEMA is a string with our schema definition
ParquetWriter<ArrayWritable> writer = new ParquetWriter<ArrayWritable>(fileName, new 
DataWritableWriteSupport() {
    @Override
    public WriteContext init(Configuration conf) {
        if (conf.get(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA) == null)
            conf.set(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA, schema.toString());
    }
});

Also, we were using CDH and CM 5.5.1 before, now using 5.8.3

Thanks!

I think you need to use DataWritableWriter rather than ParquetWriter . The class cast exception indicates the write support class is expecting an instance of ParquetHiveRecord instead of ArrayWritable . DataWritableWriter likely breaks down the individual records in ArrayWritable to individual messages in the form of ParquetHiveRecord and sends each to the write support.

Parquet is sort of mind bending at times. :)

Looking at the code of the DataWritableWriteSupport class: https ://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java You can see it is using the DataWritableWriter, hence you do not need to create an instance of DataWritableWriter, the idea of Write support is that you will be able to write different formats to parquet.

What you do need is to wrap your writables in ParquetHiveRecord

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM