[英]Writing Spark dataframe as parquet to S3 without creating a _temporary folder
Using pyspark I'm reading a dataframe from parquet files on Amazon S3 like 使用pyspark我正在从Amazon S3上的镶木地板文件中读取数据框
dataS3 = sql.read.parquet("s3a://" + s3_bucket_in)
This works without problems. 这没有问题。 But then I try to write the data
但后来我尝试写数据
dataS3.write.parquet("s3a://" + s3_bucket_out)
I do get the following exception 我确实得到以下异常
py4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.
: java.lang.IllegalArgumentException: java.net.URISyntaxException:
Relative path in absolute URI: s3a://<s3_bucket_out>_temporary
It seems to me that Spark is trying to create a _temporary
folder first, before it is writing to write into the given bucket. 在我看来,Spark在写入写入给定存储桶之前首先尝试创建一个
_temporary
文件夹。 Can this be prevent somehow, so that spark is writing directly to the given output bucket? 这可以以某种方式防止,所以火花直接写入给定的输出桶?
You can't eliminate the _temporary file as that's used to keep the intermediate work of a query hidden until it's complete 您无法消除_temporary文件,因为它用于在查询完成之前隐藏查询的中间工作
But that's OK, as this isn't the problem. 但那没关系,因为这不是问题所在。 The problem is that the output committer gets a bit confused trying to write to the root directory (can't delete it, see)
问题是输出提交者在尝试写入根目录时有点困惑(无法删除它,请参阅)
You need to write to a subdirectory under a bucket, with a full prefix. 您需要使用完整前缀写入存储桶下的子目录。 eg
s3a://mybucket/work/out
. 例如
s3a://mybucket/work/out
。
I should add that trying to commit data to S3A is not reliable, precisely because of the way it mimics rename()
by what is something like ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %"
我应该补充说,尝试将数据提交到S3A是不可靠的,正是因为它模仿
rename()
的方式类似于ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %"
ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %"
. ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %"
。 Because ls
has delayed consistency on S3, it can miss newly created files, so not copy them. 因为
ls
在S3上延迟了一致性,所以它可能会错过新创建的文件,因此不要复制它们。
See: Improving Apache Spark for the details. 请参阅: 改进Apache Spark以获取详细信息。
Right now, you can only reliably commit to s3a by writing to HDFS and then copying. 现在,您只能通过写入HDFS然后复制来可靠地提交到s3a。 EMR s3 works around this by using DynamoDB to offer a consistent listing
EMR s3通过使用DynamoDB提供一致的列表来解决这个问题
I had the same issue when writing the root of S3 bucket: 在编写S3存储桶的根时遇到了同样的问题:
df.save("s3://bucketname")
I resolved it by adding a /
after the bucket name: 我通过在桶名后添加
/
解决它:
df.save("s3://bucketname/")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.