如何使用Spark＆Scala读取此avro文件？

Question

I've seen various spark and avro questions (including How can I load Avros in Spark using the schema on-board the Avro file(s)? ), but none of the solutions work for me with the following avro file: 我已经看到了各种各样的spark和avro问题（包括如何使用Avro文件的板载模式在Spark中加载Avros？），但是以下avro文件都不适合我使用任何解决方案：

http://www.4shared.com/file/SxnYcdgJce/sample.html http://www.4shared.com/file/SxnYcdgJce/sample.html

When I try to read the avro file using the solution above, I get errors about it not being serializable (spark java.io.NotSerializableException: org.apache.avro.mapred.AvroWrapper). 当我尝试使用上述解决方案读取avro文件时，出现有关无法序列化的错误（spark java.io.NotSerializableException：org.apache.avro.mapred.AvroWrapper）。

How can I set up spark 1.1.0 (using scala) to read this sample avro file? 如何设置spark 1.1.0（使用scala）来读取此示例avro文件？

-- update -- -更新-

I've moved this to the mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html 我已将其移至邮件列表： http : //apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp- scala-td19400.html

Answer 1

I had the same problem when trying to read an Avro file. 尝试读取Avro文件时，我遇到了同样的问题。 The reason is that the AvroWrapper is not implementing java.io.Serializable interface. 原因是AvroWrapper没有实现java.io.Serializable接口。

The solution was to use org.apache.spark.serializer.KryoSerializer . 解决方案是使用org.apache.spark.serializer.KryoSerializer 。

import org.apache.spark.SparkConf

val cfg = new SparkConf().setAppName("MySparkJob")
cfg.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
cfg.set("spark.kryo.registrator", "com.stackoverflow.Registrator")

However that wasn't enough, as my class, which was in the Avro file, didn't implement Serializable either. 但是，这还不够，因为我的类（位于Avro文件中）也没有实现Serializable 。

Therefore I had added my own registrator, extending KryoRegistrator , and included chill-avro library. 因此，我添加了自己的注册器，扩展了KryoRegistrator ，并包含了chill-avro库。

class Registrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo): Unit = {
    kryo.register(classOf[MyClassInAvroFile], AvroSerializer.SpecificRecordBinarySerializer[MyClassInAvroFile])
    kryo.register(classOf[AnotherClassInAvroFile], AvroSerializer.SpecificRecordBinarySerializer[AnotherClassInAvroFile])
  }
}

Then I was able to read a file like this: 然后，我能够读取如下文件：

ctx.hadoopFile("/path/to/the/avro/file.avro",
  classOf[AvroInputFormat[MyClassInAvroFile]],
  classOf[AvroWrapper[MyClassInAvroFile]],
  classOf[NullWritable]
).map(_._1.datum())

Answer 2

Edit the serializer to be kryo should do the trick. 将序列化器编辑为kryo应该可以解决问题。

One way is to comment out the line in /etc/spark/conf/spark-defaults.conf: 一种方法是在/etc/spark/conf/spark-defaults.conf中注释掉该行：

spark.serializer org.apache.spark.serializer.KryoSerializer spark.serializer org.apache.spark.serializer.KryoSerializer

Answer 3

我的解决方案是使用spark 1.2和sparkSQL链接到我的问题中：

val person = sqlContext.avroFile("/tmp/person.avro")

如何使用Spark＆Scala读取此avro文件？

问题描述

3 个解决方案

解决方案1
4 2015-02-07 15:40:29

解决方案2
2 2014-12-24 16:26:26

解决方案3
1 已采纳 2014-12-27 16:12:36

如何使用Spark＆Scala读取此avro文件？

问题描述

3 个解决方案

解决方案1 4 2015-02-07 15:40:29

解决方案2 2 2014-12-24 16:26:26

解决方案3 1 已采纳 2014-12-27 16:12:36

解决方案1
4 2015-02-07 15:40:29

解决方案2
2 2014-12-24 16:26:26

解决方案3
1 已采纳 2014-12-27 16:12:36