使用Spark-Java讀取HDFS中存儲的Avro表和架構

Question

我正在嘗試讀取存儲在HDFS中的Avro表，同時指定也存儲在HDFS中的架構。 目前，我有這個解決方案似乎有效：

    RDD<String> stringRDD = sparkContext.textFile(schemaPath, 1);
    String [] collect = (String []) stringRDD.collect();
    String schema = collect[0];
    Dataset<Row> df  =sqlContext.read().format("com.databricks.spark.avro").option("avroSchema", schema)
            .load(tablePath);

這是最好的方法嗎？ 如果模式足夠大，例如可以有2個分區怎么辦？ 我應該使用reduce（）合並它們嗎？

干杯

Answer 1

我知道自問這個問題已經有一年了，但是我最近正想做同樣的事情，這個問題在Google上名列前茅。

因此，我能夠使用Hadoop的FileSystem類來做到這一點：

import org.apache.avro.Schema;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.Path;

String schemaPath = "/path/to/schema/in/hdfs.avsc";
FSDataInputStream schemaFile = FileSystem.get(sparkContext.hadoopConfiguration).open(new Path(schemaPath));
Schema schema = new Schema.Parser().parse(schemaFile);
String schemaString = schema.toString();

希望這可以幫助！

Answer 2

使用Spark 2.1.1的另一種方法

import org.apache.avro.Schema
val avroSchema = spark.sparkContext.wholeTextFiles(source).take(1)(0)._2
val schema = new Schema.Parser().parse(avroSchema)

使用Spark-Java讀取HDFS中存儲的Avro表和架構

問題描述

2 個解決方案

解決方案1
0 2017-12-05 12:13:16

解決方案2
0 2019-02-27 13:15:41

使用Spark-Java讀取HDFS中存儲的Avro表和架構

問題描述

2 個解決方案

解決方案1 0 2017-12-05 12:13:16

解決方案2 0 2019-02-27 13:15:41

解決方案1
0 2017-12-05 12:13:16

解決方案2
0 2019-02-27 13:15:41