简体   繁体   中英

Reading Avro table and schema stored in HDFS using Spark- Java

I am trying to read an Avro table stored in HDFS specifying also the schema which is also stored in HDFS. For the moment I have this solution that seems to work:

    RDD<String> stringRDD = sparkContext.textFile(schemaPath, 1);
    String [] collect = (String []) stringRDD.collect();
    String schema = collect[0];
    Dataset<Row> df  =sqlContext.read().format("com.databricks.spark.avro").option("avroSchema", schema)
            .load(tablePath);

Is this the best way to do that? What if the schema is big enough to have 2 partitions for example? Should I merge all of them using reduce()?

Cheers

I know that it's been a year since this was asked, but I was recently looking to do the same thing and this question came up on top in google.

So, I was able to do this using Hadoop's FileSystem class:

import org.apache.avro.Schema;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.Path;

String schemaPath = "/path/to/schema/in/hdfs.avsc";
FSDataInputStream schemaFile = FileSystem.get(sparkContext.hadoopConfiguration).open(new Path(schemaPath));
Schema schema = new Schema.Parser().parse(schemaFile);
String schemaString = schema.toString();

Hope this helps!

Another approach using Spark 2.1.1

import org.apache.avro.Schema
val avroSchema = spark.sparkContext.wholeTextFiles(source).take(1)(0)._2
val schema = new Schema.Parser().parse(avroSchema)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM