[英]How to extract schema from an avro file in Java
How do you extract first the schema and then the data from an avro file in Java?您如何首先从 Java 的 avro 文件中提取模式然后提取数据? Identical to this question except in java.
与此问题相同,但在 java 中除外。
I've seen examples of how to get the schema from an avsc file but not an avro file.我已经看到了如何从 avsc 文件而不是 avro 文件中获取模式的示例。 What direction should I be looking in?
我应该朝哪个方向看?
Schema schema = new Schema.Parser().parse(
new File("/home/Hadoop/Avro/schema/emp.avsc")
);
If you want know the schema of a Avro file without having to generate the corresponding classes or care about which class the file belongs to, you can use the GenericDatumReader
:如果您想知道 Avro 文件的架构,而不必生成相应的类或关心文件属于哪个类,您可以使用
GenericDatumReader
:
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(new File("file.avro"), datumReader);
Schema schema = dataFileReader.getSchema();
System.out.println(schema);
And then you can read the data inside the file:然后你可以读取文件中的数据:
GenericRecord record = null;
while (dataFileReader.hasNext()) {
record = dataFileReader.next(record);
System.out.println(record);
}
Thanks for @Helder Pereira's answer.感谢@Helder Pereira 的回答。 As a complement, the schema can also be fetched from
getSchema()
of GenericRecord
instance.作为补充,模式也可以从
GenericRecord
实例的getSchema()
中获取。
Here is an live demo about it, the link above shows how to get data and schema in java for Parquet
, ORC
and AVRO
data format.这是一个关于它的现场演示,上面的链接显示了如何在 Java 中获取
Parquet
、 ORC
和AVRO
数据格式的数据和模式。
You can use the data bricks library as shown here https://github.com/databricks/spark-avro which will load the avro file into a Dataframe
( Dataset<Row>
)您可以使用此处显示的数据砖库https://github.com/databricks/spark-avro它将 avro 文件加载到
Dataset<Row>
Dataframe
( Dataset<Row>
)
Once you have a Dataset<Row>
, you can directly get the schema using df.schema()
一旦你有了
Dataset<Row>
,你就可以直接使用df.schema()
获取模式
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.