简体   繁体   English

将Spark数据帧作为镶木地板写入S3而不创建_temporary文件夹

[英]Writing Spark dataframe as parquet to S3 without creating a _temporary folder

Using pyspark I'm reading a dataframe from parquet files on Amazon S3 like 使用pyspark我正在从Amazon S3上的镶木地板文件中读取数据框

dataS3 = sql.read.parquet("s3a://" + s3_bucket_in)

This works without problems. 这没有问题。 But then I try to write the data 但后来我尝试写数据

dataS3.write.parquet("s3a://" + s3_bucket_out)

I do get the following exception 我确实得到以下异常

py4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.
: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Relative path in absolute URI: s3a://<s3_bucket_out>_temporary

It seems to me that Spark is trying to create a _temporary folder first, before it is writing to write into the given bucket. 在我看来,Spark在写入写入给定存储桶之前首先尝试创建一个_temporary文件夹。 Can this be prevent somehow, so that spark is writing directly to the given output bucket? 这可以以某种方式防止,所以火花直接写入给定的输出桶?

You can't eliminate the _temporary file as that's used to keep the intermediate work of a query hidden until it's complete 您无法消除_temporary文件,因为它用于在查询完成之前隐藏查询的中间工作

But that's OK, as this isn't the problem. 但那没关系,因为这不是问题所在。 The problem is that the output committer gets a bit confused trying to write to the root directory (can't delete it, see) 问题是输出提交者在尝试写入根目录时有点困惑(无法删除它,请参阅)

You need to write to a subdirectory under a bucket, with a full prefix. 您需要使用完整前缀写入存储桶下的子目录。 eg s3a://mybucket/work/out . 例如s3a://mybucket/work/out

I should add that trying to commit data to S3A is not reliable, precisely because of the way it mimics rename() by what is something like ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %" 我应该补充说,尝试将数据提交到S3A是不可靠的,正是因为它模仿rename()的方式类似于ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %" ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %" . ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %" Because ls has delayed consistency on S3, it can miss newly created files, so not copy them. 因为ls在S3上延迟了一致性,所以它可能会错过新创建的文件,因此不要复制它们。

See: Improving Apache Spark for the details. 请参阅: 改进Apache Spark以获取详细信息。

Right now, you can only reliably commit to s3a by writing to HDFS and then copying. 现在,您只能通过写入HDFS然后复制来可靠地提交到s3a。 EMR s3 works around this by using DynamoDB to offer a consistent listing EMR s3通过使用DynamoDB提供一致的列表来解决这个问题

I had the same issue when writing the root of S3 bucket: 在编写S3存储桶的根时遇到了同样的问题:

df.save("s3://bucketname")

I resolved it by adding a / after the bucket name: 我通过在桶名后添加/解决它:

df.save("s3://bucketname/")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 spark:在 s3 上写入镶木地板时出现 SAXParseException - spark: SAXParseException while writing to parquet on s3 如何在没有火花的情况下从 S3 读取 Parquet 文件? Java - How to read Parquet file from S3 without spark? Java 使用 Pyspark 在 s3 中写入镶木地板文件时出错 - Error writing parquet file in s3 with Pyspark 使用 pyspark 到 pyspark dataframe 从 s3 位置读取镶木地板文件的文件夹 - Read a folder of parquet files from s3 location using pyspark to pyspark dataframe 在S3中的多个实木复合地板文件上创建Hive表 - Creating Hive table on top of multiple parquet files in s3 EMR Spark无法将Dataframe保存到S3 - EMR Spark Fails to Save Dataframe to S3 将 spark 数据帧从 azure 数据块写入 S3 会导致 java.lang.VerifyError: 操作数堆栈错误类型错误 - Writing spark dataframe from azure databricks to S3 causes java.lang.VerifyError: Bad type on operand stack error 用Impala写入S3实木复合地板 - Write to S3 parquet with Impala 你能使用Spark SQL / Hive / Presto直接从Parquet / S3复制到Redshift吗? - Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto? 如何使用新的 Hadoop parquet magic commiter 通过 Spark 自定义 S3 服务器 - How to use new Hadoop parquet magic commiter to custom S3 server with Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM