I am trying to read an Avro table stored in HDFS specifying also the schema which is also stored in HDFS. For the moment I have this solution that seems to work:
RDD<String> stringRDD = sparkContext.textFile(schemaPath, 1);
String [] collect = (String []) stringRDD.collect();
String schema = collect[0];
Dataset<Row> df =sqlContext.read().format("com.databricks.spark.avro").option("avroSchema", schema)
.load(tablePath);
Is this the best way to do that? What if the schema is big enough to have 2 partitions for example? Should I merge all of them using reduce()?
Cheers
I know that it's been a year since this was asked, but I was recently looking to do the same thing and this question came up on top in google.
So, I was able to do this using Hadoop's FileSystem class:
import org.apache.avro.Schema;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.Path;
String schemaPath = "/path/to/schema/in/hdfs.avsc";
FSDataInputStream schemaFile = FileSystem.get(sparkContext.hadoopConfiguration).open(new Path(schemaPath));
Schema schema = new Schema.Parser().parse(schemaFile);
String schemaString = schema.toString();
Hope this helps!
Another approach using Spark 2.1.1
import org.apache.avro.Schema
val avroSchema = spark.sparkContext.wholeTextFiles(source).take(1)(0)._2
val schema = new Schema.Parser().parse(avroSchema)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.