[英]How to read avro file in spark using newAPIHadoopFile?
我正在嘗試在Spark作業中讀取na Avro
文件。
我的Spark版本是1.6.0
(spark-core_2.10-1.6.0-cdh5.7.1)。
這是我的Java代碼:
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("ReadAvro"));
JavaPairRDD <NullWritable, Text> lines = sc.newAPIHadoopFile(args[0],AvroKeyValueInputFormat.class,AvroKey.class,AvroValue.class,new Configuration());
但是我遇到了編譯時異常:
類型為JavaSparkContext的方法newAPIHadoopFile(String,Class,Class,Class,Configuration)不適用於參數(String,Class,Class,Class,Configuration)
那么在Java中使用JavaSparkContext.newAPIHadoopFile()
的正確方法是什么?
public class Utils {
public static <T> JavaPairRDD<String, T> loadAvroFile(JavaSparkContext sc, String avroPath) {
JavaPairRDD<AvroKey, NullWritable> records = sc.newAPIHadoopFile(avroPath, AvroKeyInputFormat.class, AvroKey.class, NullWritable.class, sc.hadoopConfiguration());
return records.keys()
.map(x -> (GenericRecord) x.datum())
.mapToPair(pair -> new Tuple2<>((String) pair.get("key"), (T)pair.get("value")));
}
}
將該實用程序用作:
JavaPairRDD<String, YourAvroClassName> records = Utils.<YourAvroClassName>loadAvroFile(sc, inputDir);
您可能還需要使用KryoSerializer並注冊您的自定義KryoRegistrator:
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.kryo.registrator", "com.test.avro.MyKryoRegistrator");
public class MyKryoRegistrator implements KryoRegistrator {
public static class SpecificInstanceCollectionSerializer<T extends Collection> extends CollectionSerializer {
Class<T> type;
public SpecificInstanceCollectionSerializer(Class<T> type) {
this.type = type;
}
@Override
protected Collection create(Kryo kryo, Input input, Class<Collection> type) {
return kryo.newInstance(this.type);
}
@Override
protected Collection createCopy(Kryo kryo, Collection original) {
return kryo.newInstance(this.type);
}
}
Logger logger = LoggerFactory.getLogger(this.getClass());
@Override
public void registerClasses(Kryo kryo) {
// Avro POJOs contain java.util.List which have GenericData.Array as their runtime type
// because Kryo is not able to serialize them properly, we use this serializer for them
kryo.register(GenericData.Array.class, new SpecificInstanceCollectionSerializer<>(ArrayList.class));
kryo.register(YourAvroClassName.class);
}
}
希望這可以幫助...
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.