How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

Question

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully. However the empty folders that spark generates "$ folder $" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?

---------------------S3 image ---------------------

Answer 1

Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found...

Those $folder$ are created via Hadoop.Apache Hadoop creates these files when to create a folder in an S3 bucket. Source1 They are actually directory markers as path + /. Source 2
To change the behavior, you need to change the Hadoop S3 write configuration in Spark context. Read this and this and this
Read about S3, S3a and S3n here and here
Thanks to @stevel 's comment here

Now the solution is to set the following configuration in Spark context Hadoop.

sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

To avoid creation of SUCCESS files you need to set the following configuration as well: hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Make sure you use the S3 URI for writing to s3 bucket. ex:

myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

Question

1 answers

solution1
7 ACCPTED 2021-01-15 11:43:46

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

Question

1 answers

solution1 7 ACCPTED 2021-01-15 11:43:46

solution1
7 ACCPTED 2021-01-15 11:43:46