简体   繁体   English

通过 pyspark 胶水作业写入镶木地板文件时,s3 存储桶被删除

[英]s3 bucket is getting deleted while writing parquet files via pyspark glue job

I have developed a Pyspark Glue job for loading the complete/incremental dataset.我开发了一个Pyspark Glue作业来加载complete/incremental数据集。 It is working fine.它工作正常。 After loading the dataset I have to perform few aggregations and write it in "overwrite"/"append" mode in a single location.加载数据集后,我必须执行一些aggregations并将其以"overwrite"/"append"模式写入单个位置。 For this, I have written the below code:为此,我编写了以下代码:

        maxDateValuePath = "s3://...../maxValue/"
        outputPath = "s3://..../complete-load/"
        aggregatedPath = "s3://...../aggregated-output/"
        fullLoad = ""
        aggregatedView = ""
        completeAggregatedPath = "s3://...../aggregated-output/step=complete-load/"
        incrAggregatedPath = "s3://....../aggregated-output/step=incremental-load/"
       
        aggregatedView=""
        data.createOrReplaceTempView("data")
        aggregatedView = spark.sql("""
        select catid,count(*) as number_of_catids from data 
        group by catid""")
    if (incrementalLoad == str(0)):
        aggregatedView = aggregatedView.withColumn("created_at", current_timestamp())
        aggregatedView.write.mode("overwrite").parquet(completeAggregatedPath)
    elif (incrementalLoad == str(1)):
        aggregatedView = aggregatedView.withColumn("created_at", current_timestamp())
        log.info("step 123: " + str(aggregatedView.count()))
        aggregatedView.write.mode("append").parquet(completeAggregatedPath)
        aggregatedView = spark.read.parquet(completeAggregatedPath)
        log.info("step 126: " + str(aggregatedView.count()))
        w = Window.partitionBy("catid").orderBy(col("created_at").desc())
        aggregatedView = aggregatedView.withColumn("rw", row_number().over(w)).filter(col("rw") == lit(1)).drop(
            "rw")
        log.info("step 130: " + str(aggregatedView.count()))
        log.info(aggregatedView.orderBy(col("created_at").desc()).show())
        print("::::::::::::before writing::::::::::::::")
        aggregatedView.write.mode("overwrite").parquet(incrAggregatedPath)

Where 0 and 1 stand for full load/incremental load.其中01代表满载/增量负载。 Now, before writing the transformed dataset I am adding a created_at column for handling the latest aggregated records after writing the incremental dataset, or else it leads to duplicates.现在,在写入转换后的数据集之前,我添加了一个created_at列,用于在写入增量数据集后处理最新的聚合记录,否则会导致重复。

Everything is working fine as expected but the problem is while I am trying to write the dataset in overwrite mode using this line aggregatedView.write.mode("overwrite").parquet(aggregatedPath) of the incremental part, the bucket gets deleted in s3 and this operation results in the below error :一切都按预期正常工作,但问题是当我尝试使用增量部分的这一行以覆盖模式写入数据集时,该行aggregatedView.write.mode("overwrite").parquet(aggregatedPath) ,存储桶在s3中被删除此操作导致以下error

    Caused by: java.io.FileNotFoundException: File not present on S3
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
  

Why the bucket is getting deleted?为什么存储桶会被删除?

The problem is in your code.问题出在您的代码中。 You are reading and writing to the same location in the incremental section.您正在对增量部分中的同一位置进行读取和写入。

aggregatedView = spark.read.parquet(aggregatedPath)
...
...
aggregatedView.write.mode("overwrite").parquet(aggregatedPath)  

Since spark does a lazy evaluation, whats happening is that when you specify the mode as overwrite, it clears off the data in the particular folder which leaves you with nothing to read.由于 spark 进行惰性评估,因此当您将模式指定为覆盖时,它会清除特定文件夹中的数据,从而使您无法阅读。 When it reaches the write part of the code, it starts to read the data, by which time your data has already been cleared by your write action.当它到达代码的写入部分时,它开始读取数据,此时您的数据已被您的写入操作清除。

So, I fixed my problem by changing the below line of my code:因此,我通过更改以下代码行来解决我的问题:

aggregatedView2 = spark.read.parquet(completeAggregatedPath)

So for aggregated view there will be one df lineage.因此,对于聚合视图,将有一个 df 谱系。 Since read and write was being performed on the same s3 location and on the same df lineage, it was deleting the prefix because the source data for df was ambiguous.由于读写是在同一个 s3 位置和同一个 df 沿袭上执行的,它删除了前缀,因为 df 的源数据不明确。 Therefore, created a new df where it will look for the S3 location and not the previous transformations.因此,创建了一个新的 df,它将在其中查找 S3 位置而不是以前的转换。

May be it will help someone!也许它会帮助某人!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 尝试将镶木地板文件写入 S3 存储桶时出现 PySpark SparkSession 错误:org.apache.spark.SparkException:写入行时任务失败 - PySpark SparkSession error when trying to write parquet files to S3 bucket: org.apache.spark.SparkException: Task failed while writing rows 将镶木地板文件写入现有的 AWS S3 存储桶 - Writing parquet files into existing AWS S3 bucket AWS Glue 作业在写入 S3 时被拒绝访问 - AWS Glue Job getting Access Denied when writing to S3 使用 Pyspark 在 s3 中写入镶木地板文件时出错 - Error writing parquet file in s3 with Pyspark 写入 aws s3 存储桶时 Spark 作业失败 - - Spark job failing while writing to aws s3 bucket - AWS Glue 作业以 Parquet 格式错误写入 s3,但未找到 - AWS Glue job write to s3 in parquet format error with Not Found 每当文件进入 s3 存储桶时获取文件详细信息,然后为每个单独的文件触发粘合作业 - get file details whenever a file land in s3 bucket & then trigger a glue job for each individual files 从 Glue 运行时在 2 个 AWS 账户之间写入时设置 S3 存储桶权限 - Setting S3 Bucket permissions when writing between 2 AWS Accounts while running from Glue 如何通过 OPENROWSET (SQL Server) 列出 s3 存储桶文件夹中的所有镶木地板文件? - How to list all parquet files in s3 bucket folder via OPENROWSET (SQL Server)? 谁删除了 S3 存储桶中的文件? - Who has deleted files in S3 bucket?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM