Strange error while writing parquet file to s3

Question

While trying to write a dataframe to S3 I am getting the below error with nullpointerexception. Sometimes the job goes through fine and sometime its failing.

I am using EMR 5.20 and spark 2.4.0

Spark session Creation

val spark = SparkSession.builder
        .config("spark.sql.parquet.binaryAsString", "true")
        .config("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
        .config("spark.sql.parquet.filterPushdown", "true")
        .config("spark.sql.parquet.fs.optimized.committer.optimization-enabled","true")
        .getOrCreate()

spark.sql("myQuery").write.partitionBy("partitionColumn").mode(SaveMode.Overwrite).option("inferSchema","false").parquet("s3a://...filePath")

Can anyone help resolve this mystery. Thanks in advance

java.lang.NullPointerException
  at com.amazon.ws.emr.hadoop.fs.s3.lite.S3Errors.isHttp200WithErrorCode(S3Errors.java:57)
  at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:100)
  at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)
  at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:127)
  at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:364)
  at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1372)
  at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:663)
  at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:332)
  at org.apache.spark.internal.io.FileCommitProtocol.deleteWithJob(FileCommitProtocol.scala:124)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:223)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:122)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
  at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
  ... 55 elided

Answer 1

You're using SaveMode.Overwrite and the error line com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:127) indicates a problem during the deletion operation of the overwrite.

I would check and make sure the S3 permissions in the IAM policy for your EMR EC2 instance profile allow the s3:DeleteObject action for the file path in your call to write Parquet. It should look something like this:

{
  "Sid": "AllowWriteAccess",
  "Action": [
    "s3:DeleteObject",
    "s3:Get*",
    "s3:List*",
    "s3:PutObject"
  ],
  "Effect": "Allow",
  "Resource": [
    "<arn_for_your_filepath>/*"
  ]
}

In between jobs do you use different file paths in your call to write Parquet? If so then that would explain the intermittent job failures

Answer 2

Looks like a bug in the AWS code. That is closed source -you have to take it up with them.

I do see a hint that this is an error in the code trying to parse error responses. Maybe something has failed, but the code on the client to pass that error response is buggy. Isn't that unusual-it's the failure handling that rarely gets enough test coverage

Strange error while writing parquet file to s3

Question

2 answers

solution1
4 2019-11-15 10:18:00

solution2
2 2019-10-24 17:03:58

Strange error while writing parquet file to s3

Question

2 answers

solution1 4 2019-11-15 10:18:00

solution2 2 2019-10-24 17:03:58

solution1
4 2019-11-15 10:18:00

solution2
2 2019-10-24 17:03:58