I've got a problem with simple spark task, which reads Avro file and then save it as Hive parquet table.
I've got 2 types of file, in general they are the same, but the key struct is a little different - field names.
Type 1
root
|-- pk: strucnt (nullable = true)
|-- term_id: string (nullale = true)
Type 2
root
|-- pk: strucnt (nullable = true)
|-- id: string (nullale = true)
I'm reading Avro using spark-avro. And then map this DF to bean like this
Dataset<SomeClass> df = avroDF.as(Encoders.bean(SomeClass.class));
SomeClass is a simple one-field class with getter and setter.
public class SomeClass{
private String term_id;
...
}
So if I'm reading Avro type 1 - it's OK. But if I'm reading Avro type 2 - the error occures. And vice versa if I'm changing the field name to private String id;
Is there any universal solution for my problem? I found @AvroName, but it doesn't allow to set several names. Thanks.
Only one way is to change dataset fieldname to the name which is in schema. Use this example to do it:
val newName = Seq("id", "x1", "x2", "x3")
Dataset<SomeClass> df = avroDF.toDF(newNames: _*).as(Encoders.bean(SomeClass.class));
You can't cast dataframe to a BeanClass which has different field names.
The possible solution is
StructType avroExtendedSchema = avroDF.schema().add("id",DataTypes.StringType);
avroDF.map(row->RowFactory(row.getStruct(0),row.getStruct(0).getString(0)),
RowEncoder.apply(avroExtendedSchema)).toDF();
So the second field of DF will be named "id" and contain the string key. First "pk" struct can be dropped in the future.
avroDF.drop("pk");
PS I found the third type of schema:
root
|-- pk: strucnt (nullable = true)
|-- id: int(nullale = true)
So the final code is like:
DataType keyType = avroDF.select("pk.*").schema().fields[0].dataType();
StructType avroExtendedSchema = avroDF.schema().add("id",keyType);
avroDF.map(row->RowFactory(row.getStruct(0),row.getStruct(0).get(0)),
RowEncoder.apply(avroExtendedSchema)).drop("pk").toDF();
This code suites for any primitive\\String key.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.