Reading Avro table and schema stored in HDFS using Spark- Java

Question

I am trying to read an Avro table stored in HDFS specifying also the schema which is also stored in HDFS. For the moment I have this solution that seems to work:

    RDD<String> stringRDD = sparkContext.textFile(schemaPath, 1);
    String [] collect = (String []) stringRDD.collect();
    String schema = collect[0];
    Dataset<Row> df  =sqlContext.read().format("com.databricks.spark.avro").option("avroSchema", schema)
            .load(tablePath);

Is this the best way to do that? What if the schema is big enough to have 2 partitions for example? Should I merge all of them using reduce()?

Cheers

Answer 1

I know that it's been a year since this was asked, but I was recently looking to do the same thing and this question came up on top in google.

So, I was able to do this using Hadoop's FileSystem class:

import org.apache.avro.Schema;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.Path;

String schemaPath = "/path/to/schema/in/hdfs.avsc";
FSDataInputStream schemaFile = FileSystem.get(sparkContext.hadoopConfiguration).open(new Path(schemaPath));
Schema schema = new Schema.Parser().parse(schemaFile);
String schemaString = schema.toString();

Hope this helps!

Answer 2

Another approach using Spark 2.1.1

import org.apache.avro.Schema
val avroSchema = spark.sparkContext.wholeTextFiles(source).take(1)(0)._2
val schema = new Schema.Parser().parse(avroSchema)

Reading Avro table and schema stored in HDFS using Spark- Java

Question

2 answers

solution1
0 2017-12-05 12:13:16

solution2
0 2019-02-27 13:15:41

Reading Avro table and schema stored in HDFS using Spark- Java

Question

2 answers

solution1 0 2017-12-05 12:13:16

solution2 0 2019-02-27 13:15:41

solution1
0 2017-12-05 12:13:16

solution2
0 2019-02-27 13:15:41