Spark 1.3.1 中如何使用 Java 读取 AVRO 数据？

Question

I am trying to develop a Java Spark Application that reads AVRO records ( https://avro.apache.org/ ) from HDFS put there by a technology called Gobblin ( https://github.com/linkedin/gobblin/wiki ).我正在尝试开发一个 Java Spark 应用程序，该应用程序从 HDFS 读取 AVRO 记录（ https://avro.apache.org/ ），该应用程序通过称为 Gobblin（ https://github.com/linkedin/gobblin/wiki ）的技术放置在那里。

A sample HDFS AVRO data file:一个示例 HDFS AVRO 数据文件：

/gobblin/work/job-output/KAFKA/kafka-gobblin-hdfs-test/20150910213846_append/part.task_kafka-gobblin-hdfs-test_1441921123461_0.avro /gobblin/work/job-output/KAFKA/kafka-gobblin-hdfs-test/20150910213846_append/part.task_kafka-gobblin-hdfs-test_1441921123461_0.avro

Unfortunately, I am finding that there are limited examples written in Java.不幸的是，我发现用 Java 编写的示例非常有限。

The best thing I have found is written in Scala ( Using Hadoop version 1 libraries).我发现的最好的东西是用 Scala 编写的（使用 Hadoop 版本 1 库）。

https://gist.github.com/MLnick/5864741781b9340cb211 https://gist.github.com/MLnick/5864741781b9340cb211

Any help would be appreciated.任何帮助，将不胜感激。

Currently I am thinking of using the below code, though I am unsure on how to extract a HashMap of values from my AVRO data:目前我正在考虑使用以下代码，但我不确定如何从我的 AVRO 数据中提取值的 HashMap：

JavaPairRDD avroRDD = sc.newAPIHadoopFile( 
    path, 
    AvroKeyInputFormat.class, 
    AvroKey.class, 
    NullWritable.class, 
    new Configuration() );

// JavaPairRDD avroRDD = sc.newAPIHadoopFile( 
//    path, 
//    AvroKeyValueInputFormat.class, 
//    AvroKey.class, 
//    AvroValue.class, 
//    new Configuration() );

My current Maven dependencies:我当前的 Maven 依赖项：

<dependencies>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.3.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.avro</groupId>
        <artifactId>avro</artifactId>
        <version>1.7.6</version>
    </dependency>
    <dependency>
        <groupId>org.apache.avro</groupId>
        <artifactId>avro-mapred</artifactId>
        <version>1.7.6</version>
        <classifier>hadoop2</classifier>
    </dependency>
    <dependency>
      <groupId>com.fasterxml.jackson.core</groupId>
      <artifactId>jackson-annotations</artifactId>
      <version>2.4.3</version>
    </dependency>


    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <scope>test</scope>
    </dependency>

</dependencies>

Answer 1

I wrote a small prototype that was able to read as input my sample Gobblin Avro records, and using Spark, output the relevant results ( spark-hdfs-avro-test ).我编写了一个小型原型，它能够读取我的样本 Gobblin Avro 记录作为输入，并使用 Spark 输出相关结果（ spark-hdfs-avro-test ）。 It is worth mentioning the that there were a couple of issues I needed to address.值得一提的是，我需要解决几个问题。 Any comments or feedback would be greatly appreciated.任何意见或反馈将不胜感激。

Issue 1: There are issues with the current Avro release (1.7.7) and Java Serialization:问题 1：当前 Avro 版本 (1.7.7) 和 Java 序列化存在问题：

To quote:报价：

Spark relies on Java's Serializable interface to serialize objects. Spark 依赖 Java 的 Serializable 接口来序列化对象。 Avro objects don't implement Serializable. Avro 对象不实现可序列化。 So, to work with Avro objects in Spark, you need to subclass your Avro generated classes and implement Serializable, eg https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SerializableAminoAcid.java .因此，要在 Spark 中使用 Avro 对象，您需要对 Avro 生成的类进行子类化并实现可序列化，例如https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com /zenfractal/SerializableAminoAcid.java 。

To address this I wrote my own Serializable wrapper classes:为了解决这个问题，我编写了自己的 Serializable 包装类：

Issue 2: My Avro Messages don't contain a "Key" value.问题 2：我的 Avro 消息不包含“密钥”值。

Unfortunately, I was unable to use any out-of-the-box input formats and had to write my own: AvroValueInputFormat不幸的是，我无法使用任何现成的输入格式，不得不自己编写： AvroValueInputFormat

public class AvroValueInputFormat<T> extends FileInputFormat<NullWritable, AvroValue<T>> {

I was unable to use the following:我无法使用以下内容：

# org.apache.avro.mapreduce.AvroKeyInputFormat
public class AvroKeyInputFormat<T> extends FileInputFormat<AvroKey<T>, NullWritable> {

# org.apache.avro.mapreduce.AvroKeyValueInputFormat
public class AvroKeyValueInputFormat<K, V> extends FileInputFormat<AvroKey<K>, AvroValue<V>> {

Issue 3: I was unable to use the AvroJob class setters to set schema values and I had to do this manually.问题 3：我无法使用 AvroJob 类设置器来设置架构值，我必须手动执行此操作。

    hadoopConf.set( "avro.schema.input.key", Schema.create( org.apache.avro.Schema.Type.NULL ).toString() ); //$NON-NLS-1$
    hadoopConf.set( "avro.schema.input.value", Event.SCHEMA$.toString() ); //$NON-NLS-1$
    hadoopConf.set( "avro.schema.output.key", Schema.create( org.apache.avro.Schema.Type.NULL ).toString() ); //$NON-NLS-1$
    hadoopConf.set( "avro.schema.output.value", SeverityEventCount.SCHEMA$.toString() ); //$NON-NLS-1$

Spark 1.3.1 中如何使用 Java 读取 AVRO 数据？

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-10-05 21:20:58

Spark 1.3.1 中如何使用 Java 读取 AVRO 数据？

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-10-05 21:20:58

解决方案1
2 已采纳 2015-10-05 21:20:58