简体   繁体   English

使用Java在Spark Core中读取/写入Avro文件

[英]reading/writing avro file in spark core using java

I need to access avro file data in a program written in java on spark core. 我需要在Spark Core上用Java编写的程序中访问avro文件数据。 I can use the MapReduce InputFormat class but it gives me a tuple containing each line of file as a key. 我可以使用MapReduce InputFormat类,但是它给我一个元组,其中包含文件的每一行作为键。 It's very hard to parse it as i am not using scala. 我不使用scala很难解析它。

JavaPairRDD<AvroKey<GenericRecord>, AvroValue> avroRDD = sc.newAPIHadoopFile("dataset/testfile.avro", AvroKeyInputFormat.class, AvroKey.class, NullWritable.class,new Configuration()); 

Is there any utility class or jar available which i can use to map avro data directly into java classes. 是否有可用的实用程序类或jar,可以将avro数据直接映射到java类中。 Eg the codehaus.jackson package has a provision for mapping json to java class. 例如,codehaus.jackson包提供了将json映射到java类的规定。

Otherwise is there any other method to easily parse fields present in avro file to java classes or RDDs. 否则,还有其他方法可以轻松地将avro文件中存在的字段解析为java类或RDD。

Consider that your avro file contains serialized pairs, with key being a String , and value being an avro class. 考虑您的avro文件包含序列化对,其中key是String ,而value是avro类。 Then you could have a generic static function of some Utils class that looks like this: 然后,您可以具有某些Utils类的通用静态函数,如下所示:

public class Utils {

  public static <T> JavaPairRDD<String, T> loadAvroFile(JavaSparkContext sc, String avroPath) {
    JavaPairRDD<AvroKey, NullWritable> records = sc.newAPIHadoopFile(avroPath, AvroKeyInputFormat.class, AvroKey.class, NullWritable.class, sc.hadoopConfiguration());
    return records.keys()
        .map(x -> (GenericRecord) x.datum())
        .mapToPair(pair -> new Tuple2<>((String) pair.get("key"), (T)pair.get("value")));
  }
}

And then you could use the method this way: 然后您可以通过这种方式使用该方法:

JavaPairRDD<String, YourAvroClassName> records = Utils.<YourAvroClassName>loadAvroFile(sc, inputDir);

You might also need to use KryoSerializer and register your custom KryoRegistrator : 您可能还需要使用KryoSerializer并注册您的自定义KryoRegistrator

sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.kryo.registrator", "com.test.avro.MyKryoRegistrator");

And the registrator class would look this way: 注册器类将如下所示:

public class MyKryoRegistrator implements KryoRegistrator {

  public static class SpecificInstanceCollectionSerializer<T extends Collection> extends CollectionSerializer {
    Class<T> type;
    public SpecificInstanceCollectionSerializer(Class<T> type) {
      this.type = type;
    }

    @Override
    protected Collection create(Kryo kryo, Input input, Class<Collection> type) {
      return kryo.newInstance(this.type);
    }

    @Override
    protected Collection createCopy(Kryo kryo, Collection original) {
      return kryo.newInstance(this.type);
    }
  }


  Logger logger = LoggerFactory.getLogger(this.getClass());

  @Override
  public void registerClasses(Kryo kryo) {
    // Avro POJOs contain java.util.List which have GenericData.Array as their runtime type
    // because Kryo is not able to serialize them properly, we use this serializer for them
    kryo.register(GenericData.Array.class, new SpecificInstanceCollectionSerializer<>(ArrayList.class));
    kryo.register(YourAvroClassName.class);
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM