简体   繁体   English

使用反射将pojo写入镶木地板文件

[英]Write pojo's to parquet file using reflection

HI Looking for APIs to write parquest with Pojos that I have. HI正在寻找用我写的Pojos写parquest的API。 I was able to generate avro schema using reflection and then create parquet schema using AvroSchemaConverter. 我能够使用反射生成avro模式,然后使用AvroSchemaConverter创建镶木地板模式。 Also i am not able to find a way to convert Pojos to GenericRecords (avro) else I could have been able to use AvroParquetWriter to write out the Pojos into parquet files. 此外,我无法找到将Pojos转换为GenericRecords(avro)的方法,否则我本可以使用AvroParquetWriter将Pojos写入镶木地板文件。 Any suggestions ? 有什么建议么 ?

If you want to go through avro you have two options: 如果你想通过avro,你有两个选择:

1) Let avro generate your pojos (see the tutorial here ). 1)让avro生成你的pojos(参见这里的教程)。 The generated pojos extend SpecificRecord which can then be used with AvroParquetWriter. 生成的pojos扩展了SpecificRecord,然后可以与AvroParquetWriter一起使用。

2) Write the conversion from your pojo to GenericRecord yourself. 2)自己编写从pojo到GenericRecord的转换。 You can do this either manually or a more generic solution would be to use reflection. 您可以手动执行此操作,也可以使用更通用的解决方案。 However, I encountered difficulties with this approach when I tried to read the data. 但是,当我尝试读取数据时,我遇到了这种方法的困难。 Based on the supplied schema avro found the pojo in the classpath and tried to instantiate a SpecificRecord instead of GenericRecord. 基于提供的模式,avro在类路径中找到了pojo,并尝试实例化一个SpecificRecord而不是GenericRecord。 Because of this reason I went with option 1. 因为这个原因我选择了1。

Parquet also supports now writing pojo directly. Parquet也支持现在直接写pojo。 Here is the pull request on parquet github page. 是在镶木地板github页面上的拉取请求。 However, I think this is not part of an official release yet. 但是,我认为这还不是官方发布的一部分。 In another words, I did not find this code in maven. 换句话说,我没有在maven中找到这个代码。

DISCLAIMER: The following code was written when I was in a hurry. 免责声明:以下代码是在我赶时间写的。 It is not efficient and future versions of parquet will surely fix this more directly. 它效率不高,未来版本的镶木地板肯定会更直接地解决这个问题。 That being said, this is a lightweight inefficient approach to what you need. 话虽如此,这是一种轻量级的低效方法,可满足您的需求。 The strategy is POJO -> AVRO -> PARQUET 策略是POJO - > AVRO - > PARQUET

  1. POJO -> AVRO: Declare a schema via reflection. POJO - > AVRO:通过反射声明一个模式。 Declare writers and readers based on the schema. 根据模式声明编写者和读者。 At the time of conversion write the object to byte stream and read it back as avro. 在转换时将对象写入字节流并将其作为avro读回。
  2. AVRO -> Parquet: use the AvroParquetWriter included in the parquet-me project. AVRO - >实木复合地板:使用镶木地板项目中包含的AvroParquetWriter。

private static final Schema avroSchema = ReflectData.AllowNull.get().getSchema(YOURCLASS.class);
private static final ReflectDatumWriter<YOURCLASS> reflectDatumWriter = new ReflectDatumWriter<>(avroSchema);
private static final GenericDatumReader<Object> genericRecordReader = new GenericDatumReader<>(avroSchema);

public GenericRecord toAvroGenericRecord() throws IOException {
    ByteArrayOutputStream bytes = new ByteArrayOutputStream();
    reflectDatumWriter.write(this, EncoderFactory.get().directBinaryEncoder(bytes, null));
    return (GenericRecord) genericRecordReader.read(null, DecoderFactory.get().binaryDecoder(bytes.toByteArray(), null));
}

One more thing: it seems the parquet writers are currently very strict about null fields. 还有一件事:似乎镶木地板作家目前对空领域非常严格。 Make sure none of your fields are null before attempting to write to parquet 在尝试写入镶木地板之前,请确保没有任何字段为空

I wasn't able to find an existing solution, so I implemented it myself. 我无法找到现有的解决方案,所以我自己实现了。 Here is the link to the implementation: https://gist.github.com/alexeygrigorev/eab72e40c6051e0163a6693054906d66 以下是实施的链接: https//gist.github.com/alexeygrigorev/eab72e40c6051e0163a6693054906d66

In short, it does the following: 简而言之,它执行以下操作:

  • uses reflection to get Avro schema from the pojo 使用反射从pojo中获取Avro架构
  • using the schema and reflection it converts pojos to GenericRecord objects 使用模式和反射,它将pojos转换为GenericRecord对象
  • reflection is applied recursively if the pojo contains other pojos or list of pojos 如果pojo包含其他pojos或pojos列表,则递归应用反射

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM