将Spark数据帧作为镶木地板写入S3而不创建_temporary文件夹

Question

Using pyspark I'm reading a dataframe from parquet files on Amazon S3 like 使用pyspark我正在从Amazon S3上的镶木地板文件中读取数据框

dataS3 = sql.read.parquet("s3a://" + s3_bucket_in)

This works without problems. 这没有问题。 But then I try to write the data 但后来我尝试写数据

dataS3.write.parquet("s3a://" + s3_bucket_out)

I do get the following exception 我确实得到以下异常

py4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.
: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Relative path in absolute URI: s3a://<s3_bucket_out>_temporary

It seems to me that Spark is trying to create a _temporary folder first, before it is writing to write into the given bucket. 在我看来，Spark在写入写入给定存储桶之前首先尝试创建一个_temporary文件夹。 Can this be prevent somehow, so that spark is writing directly to the given output bucket? 这可以以某种方式防止，所以火花直接写入给定的输出桶？

Answer 1

You can't eliminate the _temporary file as that's used to keep the intermediate work of a query hidden until it's complete 您无法消除_temporary文件，因为它用于在查询完成之前隐藏查询的中间工作

But that's OK, as this isn't the problem. 但那没关系，因为这不是问题所在。 The problem is that the output committer gets a bit confused trying to write to the root directory (can't delete it, see) 问题是输出提交者在尝试写入根目录时有点困惑（无法删除它，请参阅）

You need to write to a subdirectory under a bucket, with a full prefix. 您需要使用完整前缀写入存储桶下的子目录。 eg s3a://mybucket/work/out . 例如s3a://mybucket/work/out 。

I should add that trying to commit data to S3A is not reliable, precisely because of the way it mimics rename() by what is something like ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %" 我应该补充说，尝试将数据提交到S3A是不可靠的，正是因为它模仿rename()的方式类似于ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %" ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %" . ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %" 。 Because ls has delayed consistency on S3, it can miss newly created files, so not copy them. 因为ls在S3上延迟了一致性，所以它可能会错过新创建的文件，因此不要复制它们。

See: Improving Apache Spark for the details. 请参阅：改进Apache Spark以获取详细信息。

Right now, you can only reliably commit to s3a by writing to HDFS and then copying. 现在，您只能通过写入HDFS然后复制来可靠地提交到s3a。 EMR s3 works around this by using DynamoDB to offer a consistent listing EMR s3通过使用DynamoDB提供一致的列表来解决这个问题

Answer 2

I had the same issue when writing the root of S3 bucket: 在编写S3存储桶的根时遇到了同样的问题：

df.save("s3://bucketname")

I resolved it by adding a / after the bucket name: 我通过在桶名后添加/解决它：

df.save("s3://bucketname/")

将Spark数据帧作为镶木地板写入S3而不创建_temporary文件夹

问题描述

2 个解决方案

解决方案1
5 已采纳 2017-09-28 09:32:18

解决方案2
3 2018-03-07 15:37:45

将Spark数据帧作为镶木地板写入S3而不创建_temporary文件夹

问题描述

2 个解决方案

解决方案1 5 已采纳 2017-09-28 09:32:18

解决方案2 3 2018-03-07 15:37:45

解决方案1
5 已采纳 2017-09-28 09:32:18

解决方案2
3 2018-03-07 15:37:45