spark scala avro write fails with AbstractMethodError

Question

im trying to read data from avro, repartition the data by a field and save it as avro format. below is my sample code. during debugging process, I cannot do a show(10) on my dataframe. it fails with the following error. can someone please help me understand what im doing wrong in my code lines?

Code:

import org.apache.spark.sql.avro._

val df = spark.read.format("avro").load("s3://test-bucekt/source.avro")

df.show(10)
df.write.partitionBy("partitioning_column").format("avro").save("s3://test-bucket/processed/processed.avro")

both show and write fails with the following error:

java.lang.AbstractMethodError: org.apache.spark.sql.avro.AvroFileFormat.shouldPrefetchData(Lorg/apache/spark/sql/SparkSession;Lorg/apache/spark/sql/types/StructType;Lorg/apache/spark/sql/types/StructType;)Z
  at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:309)
  at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:305)
  at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:404)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:70)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:283)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:375)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:751)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:710)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:719)
  ... 85 elided

Answer 1

This is a caused by an unintendedly binary-incompatible change to FileFormat in emr-5.28.0, which will be fixed when emr-5.29.0 comes out. Fortunately, for the Avro format, there is an easy workaround that can be used in emr-5.28.0. Instead of using the version of spark-avro from Maven Central, it will work if you use the spark-avro jar bundled with EMR. That is, instead of something like --packages org.apache.spark:spark-avro_2.11:2.4.4 , use --jars /usr/lib/spark/external/lib/spark-avro.jar .

Answer 2

spark-avro for spark 2.4.4 and scala 2.11.12 appears to be buggy. Downgrading to spark 2.4.3 and scala 2.11.12 works just fine

Answer 3

this drove me a bit crazy and couldn't get help from AWS. Latest version of Spark 2.4.4 definitely has issues with Avro. Downgrading to 2.4.3 fixed the issues I was having.

Answer 4

The above issue is due to jars Compatibility of Spark and Spark-Avro. Use the correct dependencies of Spark and Avro from maven central.

Spark-Avro package is available from only Spark 2.4.0 version onwards. Check your Spark version either in pom.xml or build.sbt

The following link provides information about Spark Avro binary in maven central: https://mvnrepository.com/artifact/org.apache.spark/spark-avro

spark scala avro write fails with AbstractMethodError

Question

4 answers

solution1
2 2019-12-04 18:36:04

solution2
1 2019-11-28 15:45:17

solution3
1 2019-12-03 19:36:00

solution4
0 2019-11-27 05:34:17

spark scala avro write fails with AbstractMethodError

Question

4 answers

solution1 2 2019-12-04 18:36:04

solution2 1 2019-11-28 15:45:17

solution3 1 2019-12-03 19:36:00

solution4 0 2019-11-27 05:34:17

solution1
2 2019-12-04 18:36:04

solution2
1 2019-11-28 15:45:17

solution3
1 2019-12-03 19:36:00

solution4
0 2019-11-27 05:34:17