简体   繁体   中英

Writing custom java objects to Parquet

I have some custom java objects (which internally are composed of other custom objects). I wish to write these to HDFS in parquet format.

Even after a lot of searching, most suggestions seem to be around using a avro format and the internal AvroConverter from parquet to store the objects.

Seeing this here and here , it seems like I will have to write a custom WriterSupport to accomplish this.

Is there a better way to do this? Which is more optimal, writing custom objects directly or using something like Avro as a intermediate schema definition?

You can use Avro reflection to get the schema. The code for that is like ReflectData.AllowNull.get().getSchema(CustomClass.class) . I have an example Parquet demo code snippet.

Essentially the custom Java object writer is this:

    Path dataFile = new Path("/tmp/demo.snappy.parquet");

    // Write as Parquet file.
    try (ParquetWriter<Team> writer = AvroParquetWriter.<Team>builder(dataFile)
            .withSchema(ReflectData.AllowNull.get().getSchema(Team.class))
            .withDataModel(ReflectData.get())
            .withConf(new Configuration())
            .withCompressionCodec(SNAPPY)
            .withWriteMode(OVERWRITE)
            .build()) {
        for (Team team : teams) {
            writer.write(team);
        }
    }

You can replace the Team with your custom Java class. And you can see that the Team class includes a list of Person objects, which is similar to your requirement. And Avro can get the schema without any problem.

And if you want to write to HDFS, you may need to replace the path with HDFS format. But I didn't try it personally.

BTW, my code is inspired by this parquet-example code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM