[英]How can I read this avro file using spark & scala?
I've seen various spark and avro questions (including How can I load Avros in Spark using the schema on-board the Avro file(s)? ), but none of the solutions work for me with the following avro file: 我已经看到了各种各样的spark和avro问题(包括如何使用Avro文件的板载模式在Spark中加载Avros? ),但是以下avro文件都不适合我使用任何解决方案:
http://www.4shared.com/file/SxnYcdgJce/sample.html http://www.4shared.com/file/SxnYcdgJce/sample.html
When I try to read the avro file using the solution above, I get errors about it not being serializable (spark java.io.NotSerializableException: org.apache.avro.mapred.AvroWrapper). 当我尝试使用上述解决方案读取avro文件时,出现有关无法序列化的错误(spark java.io.NotSerializableException:org.apache.avro.mapred.AvroWrapper)。
How can I set up spark 1.1.0 (using scala) to read this sample avro file? 如何设置spark 1.1.0(使用scala)来读取此示例avro文件?
-- update -- -更新-
I've moved this to the mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html 我已将其移至邮件列表: http : //apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp- scala-td19400.html
I had the same problem when trying to read an Avro file. 尝试读取Avro文件时,我遇到了同样的问题。 The reason is that the AvroWrapper is not implementing java.io.Serializable
interface. 原因是AvroWrapper没有实现java.io.Serializable
接口。
The solution was to use org.apache.spark.serializer.KryoSerializer
. 解决方案是使用org.apache.spark.serializer.KryoSerializer
。
import org.apache.spark.SparkConf
val cfg = new SparkConf().setAppName("MySparkJob")
cfg.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
cfg.set("spark.kryo.registrator", "com.stackoverflow.Registrator")
However that wasn't enough, as my class, which was in the Avro file, didn't implement Serializable
either. 但是,这还不够,因为我的类(位于Avro文件中)也没有实现Serializable
。
Therefore I had added my own registrator, extending KryoRegistrator
, and included chill-avro library. 因此,我添加了自己的注册器,扩展了KryoRegistrator
,并包含了chill-avro库。
class Registrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo): Unit = {
kryo.register(classOf[MyClassInAvroFile], AvroSerializer.SpecificRecordBinarySerializer[MyClassInAvroFile])
kryo.register(classOf[AnotherClassInAvroFile], AvroSerializer.SpecificRecordBinarySerializer[AnotherClassInAvroFile])
}
}
Then I was able to read a file like this: 然后,我能够读取如下文件:
ctx.hadoopFile("/path/to/the/avro/file.avro",
classOf[AvroInputFormat[MyClassInAvroFile]],
classOf[AvroWrapper[MyClassInAvroFile]],
classOf[NullWritable]
).map(_._1.datum())
Edit the serializer to be kryo should do the trick. 将序列化器编辑为kryo应该可以解决问题。
One way is to comment out the line in /etc/spark/conf/spark-defaults.conf: 一种方法是在/etc/spark/conf/spark-defaults.conf中注释掉该行:
spark.serializer org.apache.spark.serializer.KryoSerializer spark.serializer org.apache.spark.serializer.KryoSerializer
我的解决方案是使用spark 1.2和sparkSQL链接到我的问题中:
val person = sqlContext.avroFile("/tmp/person.avro")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.