How do I delete a parquet file in spark scala?

Question

I'm writing to a parquet file in databricks, but before doing so, I'd like to delete the old version of it.

Here's my write line:

report.coalesce(1).write.mode("append").partitionBy("Name").parquet(s"s3://${reportBucket}/reports/dashboard")

I cannot figure out how to check for existence of this file, and delete if it exists.

Some pseudo code of the class, and the lines that calls it. I'm trying check if the output file exists, if it does, then remove it. Once it's removed, the class runs twice, and append the results to the parquet file. However, it must be only deleted after run2, and not run1.

class WriteReport(val run: String = "run1") {
val report = spark.read.parquet(s"blablah")
report.createOrReplaceTempView("report")

val dashboard = spark.sql (""" 
                SELECT name as Name from Table
                """)

report.coalesce(1).write.mode("append").partitionBy("Name").parquet(s"s3://${reportBucket}/reports/dashboard")
}

val n_b = new Report ("run1")
val n_g = new Report ("run2")

Answer 1

Spark doesn't provide support on S3 delete, you can work only under framwork and other outside framwork tasks you need to build own logic outside of Spark. Before trigger EMR you need to configure lamda to clean your dir and on confirm delete trigger Spark Job.

Remove dir in local before recreate file:

    import scala.reflect.io.Directory
    import java.io.File

    val dir = new Directory(new File("/yourDirectory"))
    dir.deleteRecursively()

Remove dir in AWS S3 before recreate file:

AWS S3 has read-after-write consistency on PUTS objects,eventual consistency for overwrite PUTS and DELETES objects so once you delete will not remove same time and will take some time

So using S3 commonds and delete and run same time new job will get some issue, you need to build logic to create separate dir each time and delete once work done.

Java:

You need to use AWS sdk to delete as Spark not support any command delete from S3

    if (s3Client.doesBucketExist(bucketName)) {
                ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
                        .withBucketName(bucketName)
                        .withPrefix("foo/bar/baz");

                ObjectListing objectListing = s3Client.listObjects(listObjectsRequest);

                while (true) {
                    for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
                        s3Client.deleteObject(bucketName, objectSummary.getKey());
                    }
                    if (objectListing.isTruncated()) {
                        objectListing = s3Client.listNextBatchOfObjects(objectListing);
                    } else {
                        break;
                    }
                }
            }

BOTO S3 SDK:

     import boto3

     client = boto3.client('s3')
    client.delete_object(Bucket='mybucketname', Key='myfile.whatever')

How do I delete a parquet file in spark scala?

Question

1 answers

solution1
1 2020-09-02 04:31:05

How do I delete a parquet file in spark scala?

Question

1 answers

solution1 1 2020-09-02 04:31:05

solution1
1 2020-09-02 04:31:05