简体   繁体   English

如何使用Spark&Scala读取此avro文件?

[英]How can I read this avro file using spark & scala?

I've seen various spark and avro questions (including How can I load Avros in Spark using the schema on-board the Avro file(s)? ), but none of the solutions work for me with the following avro file: 我已经看到了各种各样的spark和avro问题(包括如何使用Avro文件的板载模式在Spark中加载Avros? ),但是以下avro文件都不适合我使用任何解决方案:

http://www.4shared.com/file/SxnYcdgJce/sample.html http://www.4shared.com/file/SxnYcdgJce/sample.html

When I try to read the avro file using the solution above, I get errors about it not being serializable (spark java.io.NotSerializableException: org.apache.avro.mapred.AvroWrapper). 当我尝试使用上述解决方案读取avro文件时,出现有关无法序列化的错误(spark java.io.NotSerializableException:org.apache.avro.mapred.AvroWrapper)。

How can I set up spark 1.1.0 (using scala) to read this sample avro file? 如何设置spark 1.1.0(使用scala)来读取此示例avro文件?

-- update -- -更新-

I've moved this to the mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html 我已将其移至邮件列表: http : //apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp- scala-td19400.html

I had the same problem when trying to read an Avro file. 尝试读取Avro文件时,我遇到了同样的问题。 The reason is that the AvroWrapper is not implementing java.io.Serializable interface. 原因是AvroWrapper没有实现java.io.Serializable接口。

The solution was to use org.apache.spark.serializer.KryoSerializer . 解决方案是使用org.apache.spark.serializer.KryoSerializer

import org.apache.spark.SparkConf

val cfg = new SparkConf().setAppName("MySparkJob")
cfg.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
cfg.set("spark.kryo.registrator", "com.stackoverflow.Registrator")

However that wasn't enough, as my class, which was in the Avro file, didn't implement Serializable either. 但是,这还不够,因为我的类(位于Avro文件中)也没有实现Serializable

Therefore I had added my own registrator, extending KryoRegistrator , and included chill-avro library. 因此,我添加了自己的注册器,扩展了KryoRegistrator ,并包含了chill-avro库。

class Registrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo): Unit = {
    kryo.register(classOf[MyClassInAvroFile], AvroSerializer.SpecificRecordBinarySerializer[MyClassInAvroFile])
    kryo.register(classOf[AnotherClassInAvroFile], AvroSerializer.SpecificRecordBinarySerializer[AnotherClassInAvroFile])
  }
}

Then I was able to read a file like this: 然后,我能够读取如下文件:

ctx.hadoopFile("/path/to/the/avro/file.avro",
  classOf[AvroInputFormat[MyClassInAvroFile]],
  classOf[AvroWrapper[MyClassInAvroFile]],
  classOf[NullWritable]
).map(_._1.datum())

Edit the serializer to be kryo should do the trick. 将序列化器编辑为kryo应该可以解决问题。

One way is to comment out the line in /etc/spark/conf/spark-defaults.conf: 一种方法是在/etc/spark/conf/spark-defaults.conf中注释掉该行:

spark.serializer org.apache.spark.serializer.KryoSerializer spark.serializer org.apache.spark.serializer.KryoSerializer

我的解决方案是使用spark 1.2和sparkSQL链接到我的问题中:

val person = sqlContext.avroFile("/tmp/person.avro")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM