简体   繁体   English

将自定义Java对象写入Parquet

[英]Writing custom java objects to Parquet

I have some custom java objects (which internally are composed of other custom objects). 我有一些自定义java对象(内部由其他自定义对象组成)。 I wish to write these to HDFS in parquet format. 我希望以镶木地板格式将这些内容写入HDFS。

Even after a lot of searching, most suggestions seem to be around using a avro format and the internal AvroConverter from parquet to store the objects. 即使经过大量的搜索,大多数建议似乎都是使用avro格式和来自镶木地板的内部AvroConverter来存储对象。

Seeing this here and here , it seems like I will have to write a custom WriterSupport to accomplish this. 看到这里这里 ,我似乎必须编写一个自定义的WriterSupport来实现这一目标。

Is there a better way to do this? 有一个更好的方法吗? Which is more optimal, writing custom objects directly or using something like Avro as a intermediate schema definition? 哪个更优,直接编写自定义对象或使用像Avro这样的中间模式定义?

You can use Avro reflection to get the schema. 您可以使用Avro反射来获取架构。 The code for that is like ReflectData.AllowNull.get().getSchema(CustomClass.class) . 它的代码就像ReflectData.AllowNull.get().getSchema(CustomClass.class) I have an example Parquet demo code snippet. 我有一个示例Parquet演示代码片段。

Essentially the custom Java object writer is this: 本质上,自定义Java对象编写器是这样的:

    Path dataFile = new Path("/tmp/demo.snappy.parquet");

    // Write as Parquet file.
    try (ParquetWriter<Team> writer = AvroParquetWriter.<Team>builder(dataFile)
            .withSchema(ReflectData.AllowNull.get().getSchema(Team.class))
            .withDataModel(ReflectData.get())
            .withConf(new Configuration())
            .withCompressionCodec(SNAPPY)
            .withWriteMode(OVERWRITE)
            .build()) {
        for (Team team : teams) {
            writer.write(team);
        }
    }

You can replace the Team with your custom Java class. 您可以使用自定义Java类替换Team And you can see that the Team class includes a list of Person objects, which is similar to your requirement. 您可以看到Team类包含Person对象列表,这与您的要求类似。 And Avro can get the schema without any problem. 而且Avro可以毫无问题地获得架构。

And if you want to write to HDFS, you may need to replace the path with HDFS format. 如果要写入HDFS,可能需要用HDFS格式替换路径。 But I didn't try it personally. 但我没有亲自尝试。

BTW, my code is inspired by this parquet-example code. 顺便说一句,我的代码的灵感来自这个镶木地板示例代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM