简体   繁体   English

无法使用 spark 在 S3 上创建分区

[英]Unable to create partition on S3 using spark

I would like to use this new functionality: overwrite specific partition without delete all data in s3我想使用这个新功能:覆盖特定分区而不删除 s3 中的所有数据

I used the new flag ( spark.sql.sources.partitionOverwriteMode="dynamic" ) and test it locally from my IDE and it worked (I was able to overwrite specific partition in s3) but when I deployed it to hdp 2.6.5 with spark 2.3.0 same code didn't create the s3 folders as expected , folder didn't create at all , only temp folder has been created我使用了新标志( spark.sql.sources.partitionOverwriteMode="dynamic" )并从我的 IDE 本地测试它并且它工作(我能够覆盖 s3 中的特定分区)但是当我将它部署到 hdp 2.6.5 时spark 2.3.0 相同的代码没有按预期创建 s3 文件夹,根本没有创建文件夹,只创建了临时文件夹

My code :我的代码:

df.write
.mode(SaveMode.Overwtite)
.partitionBy("day","hour")
.option("compression", "gzip")
.parquet(s3Path)

Have you tried spark version 2.4 ?您是否尝试过 spark 2.4版? I have worked with this version and both EMR and Glue it has worked well, to use the "dynamic" in version 2.4 just use the code:我已经使用过这个版本,并且EMRGlue都运行良好,要使用2.4 版中的“动态”,只需使用以下代码:

dataset.write.mode("overwrite")
.option("partitionOverwriteMode", "dynamic")
.partitionBy("dt")
.parquet("s3://bucket/output")

AWS documentation specifies Spark version 2.3.2 to use spark.sql.sources.partitionOverwriteMode="dynamic" . AWS 文档指定 Spark版本 2.3.2使用spark.sql.sources.partitionOverwriteMode="dynamic"

Reference click here .参考点击这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM