[英]Write pojo's to parquet file using reflection
HI Looking for APIs to write parquest with Pojos that I have. HI正在寻找用我写的Pojos写parquest的API。 I was able to generate avro schema using reflection and then create parquet schema using AvroSchemaConverter.
我能够使用反射生成avro模式,然后使用AvroSchemaConverter创建镶木地板模式。 Also i am not able to find a way to convert Pojos to GenericRecords (avro) else I could have been able to use AvroParquetWriter to write out the Pojos into parquet files.
此外,我无法找到将Pojos转换为GenericRecords(avro)的方法,否则我本可以使用AvroParquetWriter将Pojos写入镶木地板文件。 Any suggestions ?
有什么建议么 ?
If you want to go through avro you have two options: 如果你想通过avro,你有两个选择:
1) Let avro generate your pojos (see the tutorial here ). 1)让avro生成你的pojos(参见这里的教程)。 The generated pojos extend SpecificRecord which can then be used with AvroParquetWriter.
生成的pojos扩展了SpecificRecord,然后可以与AvroParquetWriter一起使用。
2) Write the conversion from your pojo to GenericRecord yourself. 2)自己编写从pojo到GenericRecord的转换。 You can do this either manually or a more generic solution would be to use reflection.
您可以手动执行此操作,也可以使用更通用的解决方案。 However, I encountered difficulties with this approach when I tried to read the data.
但是,当我尝试读取数据时,我遇到了这种方法的困难。 Based on the supplied schema avro found the pojo in the classpath and tried to instantiate a SpecificRecord instead of GenericRecord.
基于提供的模式,avro在类路径中找到了pojo,并尝试实例化一个SpecificRecord而不是GenericRecord。 Because of this reason I went with option 1.
因为这个原因我选择了1。
Parquet also supports now writing pojo directly. Parquet也支持现在直接写pojo。 Here is the pull request on parquet github page.
这是在镶木地板github页面上的拉取请求。 However, I think this is not part of an official release yet.
但是,我认为这还不是官方发布的一部分。 In another words, I did not find this code in maven.
换句话说,我没有在maven中找到这个代码。
DISCLAIMER: The following code was written when I was in a hurry. 免责声明:以下代码是在我赶时间写的。 It is not efficient and future versions of parquet will surely fix this more directly.
它效率不高,未来版本的镶木地板肯定会更直接地解决这个问题。 That being said, this is a lightweight inefficient approach to what you need.
话虽如此,这是一种轻量级的低效方法,可满足您的需求。 The strategy is POJO -> AVRO -> PARQUET
策略是POJO - > AVRO - > PARQUET
private static final Schema avroSchema = ReflectData.AllowNull.get().getSchema(YOURCLASS.class);
private static final ReflectDatumWriter<YOURCLASS> reflectDatumWriter = new ReflectDatumWriter<>(avroSchema);
private static final GenericDatumReader<Object> genericRecordReader = new GenericDatumReader<>(avroSchema);
public GenericRecord toAvroGenericRecord() throws IOException {
ByteArrayOutputStream bytes = new ByteArrayOutputStream();
reflectDatumWriter.write(this, EncoderFactory.get().directBinaryEncoder(bytes, null));
return (GenericRecord) genericRecordReader.read(null, DecoderFactory.get().binaryDecoder(bytes.toByteArray(), null));
}
One more thing: it seems the parquet writers are currently very strict about null fields. 还有一件事:似乎镶木地板作家目前对空领域非常严格。 Make sure none of your fields are null before attempting to write to parquet
在尝试写入镶木地板之前,请确保没有任何字段为空
I wasn't able to find an existing solution, so I implemented it myself. 我无法找到现有的解决方案,所以我自己实现了。 Here is the link to the implementation: https://gist.github.com/alexeygrigorev/eab72e40c6051e0163a6693054906d66
以下是实施的链接: https : //gist.github.com/alexeygrigorev/eab72e40c6051e0163a6693054906d66
In short, it does the following: 简而言之,它执行以下操作:
GenericRecord
objects GenericRecord
对象
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.