通过 pyspark 胶水作业写入镶木地板文件时，s3 存储桶被删除

Question

I have developed a Pyspark Glue job for loading the complete/incremental dataset.我开发了一个Pyspark Glue作业来加载complete/incremental数据集。 It is working fine.它工作正常。 After loading the dataset I have to perform few aggregations and write it in "overwrite"/"append" mode in a single location.加载数据集后，我必须执行一些aggregations并将其以"overwrite"/"append"模式写入单个位置。 For this, I have written the below code:为此，我编写了以下代码：

        maxDateValuePath = "s3://...../maxValue/"
        outputPath = "s3://..../complete-load/"
        aggregatedPath = "s3://...../aggregated-output/"
        fullLoad = ""
        aggregatedView = ""
        completeAggregatedPath = "s3://...../aggregated-output/step=complete-load/"
        incrAggregatedPath = "s3://....../aggregated-output/step=incremental-load/"
       
        aggregatedView=""
        data.createOrReplaceTempView("data")
        aggregatedView = spark.sql("""
        select catid,count(*) as number_of_catids from data 
        group by catid""")
    if (incrementalLoad == str(0)):
        aggregatedView = aggregatedView.withColumn("created_at", current_timestamp())
        aggregatedView.write.mode("overwrite").parquet(completeAggregatedPath)
    elif (incrementalLoad == str(1)):
        aggregatedView = aggregatedView.withColumn("created_at", current_timestamp())
        log.info("step 123: " + str(aggregatedView.count()))
        aggregatedView.write.mode("append").parquet(completeAggregatedPath)
        aggregatedView = spark.read.parquet(completeAggregatedPath)
        log.info("step 126: " + str(aggregatedView.count()))
        w = Window.partitionBy("catid").orderBy(col("created_at").desc())
        aggregatedView = aggregatedView.withColumn("rw", row_number().over(w)).filter(col("rw") == lit(1)).drop(
            "rw")
        log.info("step 130: " + str(aggregatedView.count()))
        log.info(aggregatedView.orderBy(col("created_at").desc()).show())
        print("::::::::::::before writing::::::::::::::")
        aggregatedView.write.mode("overwrite").parquet(incrAggregatedPath)

Where 0 and 1 stand for full load/incremental load.其中0和1代表满载/增量负载。 Now, before writing the transformed dataset I am adding a created_at column for handling the latest aggregated records after writing the incremental dataset, or else it leads to duplicates.现在，在写入转换后的数据集之前，我添加了一个created_at列，用于在写入增量数据集后处理最新的聚合记录，否则会导致重复。

Everything is working fine as expected but the problem is while I am trying to write the dataset in overwrite mode using this line aggregatedView.write.mode("overwrite").parquet(aggregatedPath) of the incremental part, the bucket gets deleted in s3 and this operation results in the below error :一切都按预期正常工作，但问题是当我尝试使用增量部分的这一行以覆盖模式写入数据集时，该行aggregatedView.write.mode("overwrite").parquet(aggregatedPath) ，存储桶在s3中被删除此操作导致以下error ：

    Caused by: java.io.FileNotFoundException: File not present on S3
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

Why the bucket is getting deleted?为什么存储桶会被删除？

Answer 1

The problem is in your code.问题出在您的代码中。 You are reading and writing to the same location in the incremental section.您正在对增量部分中的同一位置进行读取和写入。

aggregatedView = spark.read.parquet(aggregatedPath)
...
...
aggregatedView.write.mode("overwrite").parquet(aggregatedPath)

Since spark does a lazy evaluation, whats happening is that when you specify the mode as overwrite, it clears off the data in the particular folder which leaves you with nothing to read.由于 spark 进行惰性评估，因此当您将模式指定为覆盖时，它会清除特定文件夹中的数据，从而使您无法阅读。 When it reaches the write part of the code, it starts to read the data, by which time your data has already been cleared by your write action.当它到达代码的写入部分时，它开始读取数据，此时您的数据已被您的写入操作清除。

Answer 2

So, I fixed my problem by changing the below line of my code:因此，我通过更改以下代码行来解决我的问题：

aggregatedView2 = spark.read.parquet(completeAggregatedPath)

So for aggregated view there will be one df lineage.因此，对于聚合视图，将有一个 df 谱系。 Since read and write was being performed on the same s3 location and on the same df lineage, it was deleting the prefix because the source data for df was ambiguous.由于读写是在同一个 s3 位置和同一个 df 沿袭上执行的，它删除了前缀，因为 df 的源数据不明确。 Therefore, created a new df where it will look for the S3 location and not the previous transformations.因此，创建了一个新的 df，它将在其中查找 S3 位置而不是以前的转换。

May be it will help someone!也许它会帮助某人！

通过 pyspark 胶水作业写入镶木地板文件时，s3 存储桶被删除

问题描述

2 个解决方案

解决方案1
0 2021-10-14 02:28:59

解决方案2
0 2021-10-19 08:31:33

通过 pyspark 胶水作业写入镶木地板文件时，s3 存储桶被删除

问题描述

2 个解决方案

解决方案1 0 2021-10-14 02:28:59

解决方案2 0 2021-10-19 08:31:33

解决方案1
0 2021-10-14 02:28:59

解决方案2
0 2021-10-19 08:31:33