简体   繁体   中英

spark scala avro write fails with AbstractMethodError

im trying to read data from avro, repartition the data by a field and save it as avro format. below is my sample code. during debugging process, I cannot do a show(10) on my dataframe. it fails with the following error. can someone please help me understand what im doing wrong in my code lines?

Code:

import org.apache.spark.sql.avro._

val df = spark.read.format("avro").load("s3://test-bucekt/source.avro")

df.show(10)
df.write.partitionBy("partitioning_column").format("avro").save("s3://test-bucket/processed/processed.avro")

both show and write fails with the following error:

java.lang.AbstractMethodError: org.apache.spark.sql.avro.AvroFileFormat.shouldPrefetchData(Lorg/apache/spark/sql/SparkSession;Lorg/apache/spark/sql/types/StructType;Lorg/apache/spark/sql/types/StructType;)Z
  at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:309)
  at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:305)
  at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:404)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:70)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:283)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:375)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:751)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:710)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:719)
  ... 85 elided

This is a caused by an unintendedly binary-incompatible change to FileFormat in emr-5.28.0, which will be fixed when emr-5.29.0 comes out. Fortunately, for the Avro format, there is an easy workaround that can be used in emr-5.28.0. Instead of using the version of spark-avro from Maven Central, it will work if you use the spark-avro jar bundled with EMR. That is, instead of something like --packages org.apache.spark:spark-avro_2.11:2.4.4 , use --jars /usr/lib/spark/external/lib/spark-avro.jar .

spark-avro for spark 2.4.4 and scala 2.11.12 appears to be buggy. Downgrading to spark 2.4.3 and scala 2.11.12 works just fine

this drove me a bit crazy and couldn't get help from AWS. Latest version of Spark 2.4.4 definitely has issues with Avro. Downgrading to 2.4.3 fixed the issues I was having.

The above issue is due to jars Compatibility of Spark and Spark-Avro. Use the correct dependencies of Spark and Avro from maven central.

Spark-Avro package is available from only Spark 2.4.0 version onwards. Check your Spark version either in pom.xml or build.sbt

The following link provides information about Spark Avro binary in maven central: https://mvnrepository.com/artifact/org.apache.spark/spark-avro

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM