简体   繁体   中英

Strange error while writing parquet file to s3

While trying to write a dataframe to S3 I am getting the below error with nullpointerexception. Sometimes the job goes through fine and sometime its failing.

I am using EMR 5.20 and spark 2.4.0

Spark session Creation

val spark = SparkSession.builder
        .config("spark.sql.parquet.binaryAsString", "true")
        .config("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
        .config("spark.sql.parquet.filterPushdown", "true")
        .config("spark.sql.parquet.fs.optimized.committer.optimization-enabled","true")
        .getOrCreate()

spark.sql("myQuery").write.partitionBy("partitionColumn").mode(SaveMode.Overwrite).option("inferSchema","false").parquet("s3a://...filePath")

Can anyone help resolve this mystery. Thanks in advance

java.lang.NullPointerException
  at com.amazon.ws.emr.hadoop.fs.s3.lite.S3Errors.isHttp200WithErrorCode(S3Errors.java:57)
  at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:100)
  at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)
  at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:127)
  at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:364)
  at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1372)
  at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:663)
  at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:332)
  at org.apache.spark.internal.io.FileCommitProtocol.deleteWithJob(FileCommitProtocol.scala:124)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:223)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:122)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
  at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
  ... 55 elided

You're using SaveMode.Overwrite and the error line com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:127) indicates a problem during the deletion operation of the overwrite.

I would check and make sure the S3 permissions in the IAM policy for your EMR EC2 instance profile allow the s3:DeleteObject action for the file path in your call to write Parquet. It should look something like this:

{
  "Sid": "AllowWriteAccess",
  "Action": [
    "s3:DeleteObject",
    "s3:Get*",
    "s3:List*",
    "s3:PutObject"
  ],
  "Effect": "Allow",
  "Resource": [
    "<arn_for_your_filepath>/*"
  ]
}

In between jobs do you use different file paths in your call to write Parquet? If so then that would explain the intermittent job failures

Looks like a bug in the AWS code. That is closed source -you have to take it up with them.

I do see a hint that this is an error in the code trying to parse error responses. Maybe something has failed, but the code on the client to pass that error response is buggy. Isn't that unusual-it's the failure handling that rarely gets enough test coverage

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM