简体   繁体   中英

Serialize different types of data into ORC format in java

I am able to convert CSV data to ORC format. But asper the new requirement my application needs to serialize the input data that can be in any formats like CSV, Avro, etc,.. into ORC writer understandable format and write it to a kafka topic. Later my application or some other application needs to read these data from kafka topic and write them as ORC files. The input data is enclosed in an object as an attribute, the same object has ORC schema defined as another attribute.

If you can already create ORC format from CSV/Avro/etc sources, you can create small-ish ORC files, say around 10MB each, and you can stuff them in Kafka using your own serialization method, say something like Google protocol buffers: https://developers.google.com/protocol-buffers/docs/overview

You can define the metadata in your own fields (filename, path/directory, etc), and send the actual binary ORC file as a simple byte array.

On the Kafka consumer side, whoever consumes the messages only needs to deserialize them using the protobuf schema and store the byte arrays received as HDFS/S3/etc files with the proper filenames, path, etc. One of the big advantages here is that Protobuf and Kafka don't care about what you're sending in the byte array field. It could be plain text, it could be a ORC, binary AVRO, etc. As long as you name them properly in the target destination storage, they should work.

A few caveats:

  • You would need to adjust the defaults in the kafka install to allow messages larger than 1MB, which is the default for max sizes. Make sure to look at this answer to change all the required config values: How can I send large messages with Kafka (over 15MB)?

  • If there is Hive downstream, make sure to define your Hive tables properly (with ORC format, Avro, parquet, text, etc) so that they're readable.

  • Smallish files (less than 100MB each) don't work very well if you're trying to use these files with like HDFS or EMR/S3 storage, especially if you have huge amounts of data. So as a final step, you probably want to merge files once the transfer process is completed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM