在pyspark中，在s3上覆盖csv文件失败

Question

When I load data into pyspark dataframe from s3 bucket then make some manipulations (join, union) and then I try to overwrite the same path ('data/csv/') I read before. 当我从s3存储桶将数据加载到pyspark数据帧中然后进行一些操作（join，union）然后我尝试覆盖之前读过的相同路径（'data / csv /'） 。 I'm getting this error: 我收到这个错误：

py4j.protocol.Py4JJavaError: An error occurred while calling o4635.save.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 in stage 120.0 failed 4 times, most recent failure: Lost task 200.3 in stage 120.0: java.io.FileNotFoundException: Key 'data/csv/part-00000-68ea927d-1451-4a84-acc7-b91e94d0c6a3-c000.csv' does not exist in S3
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

csv_a = spark \
    .read \
    .format('csv') \
    .option("header", "true") \
    .load('s3n://mybucket/data/csv') \
    .where('some condition')

csv_b = spark \
    .read \
    .format('csv') \
    .option("header", "true") \
    .load('s3n://mybucket/data/model/csv')
    .alias('csv')

# Reading glue categories data
cc = spark \
    .sql("select * from mydatabase.mytable where month='06'") \
    .alias('cc')

# Joining and Union
output = csv_b \
    .join(cc, (csv_b.key == cc.key), 'inner') \
    .select('csv.key', 'csv.created_ts', 'cc.name', 'csv.text') \
    .drop_duplicates(['key']) \
    .union(csv_a) \
    .orderBy('name') \
    .coalesce(1) \
    .write \
    .format('csv') \
    .option('header', 'true') \
    .mode('overwrite') \
    .save('s3n://mybucket/data/csv')

I need to read data from s3 location, then, join, union with another data and finally overwrite initial path to keep only one csv file with clean joined data. 我需要从s3位置读取数据，然后加入，与另一个数据联合，最后覆盖初始路径，只保留一个带有干净连接数据的csv文件。

If I try to read (load) data from another s3 path not the same as I need to overwrite, it works and overwrites ok. 如果我尝试从另一个s3路径读取（加载）数据与我需要覆盖的不一样，它可以工作并覆盖正常。

Any ideas why does this error happen? 任何想法为什么会发生这种错误？

Answer 1

When you are reading data from a folder, modifying it and saving on top of the data which you have initially read, spark tries to overwrite the same key on s3 (file on hdfs) etc... 当您从文件夹中读取数据，修改它并保存在您最初读取的数据之上时，spark会尝试覆盖s3上的相同键（hdfs上的文件）等...

I've found 2 options: 我找到了2个选项：

Save data to the temp folder and then read it again 将数据保存到临时文件夹，然后再次读取
Dump into memory, disk or both using df.persist() 使用df.persist（）转储到内存，磁盘或两者

Resolved by adding .persist(StorageLevel.MEMORY_AND_DISK) 通过添加.persist（StorageLevel.MEMORY_AND_DISK）解决

output = csv_b \
    .join(cc, (csv_b.key == cc.key), 'inner') \
    .select('csv.key', 'csv.created_ts', 'cc.name', 'csv.text') \
    .drop_duplicates(['key']) \
    .union(csv_a) \
    .orderBy('name') \
    .coalesce(1) \
    .persist(StorageLevel.MEMORY_AND_DISK) \
    .write \
    .format('csv') \
    .option('header', 'true') \
    .mode('overwrite') \
    .save('s3n://mybucket/data/csv')

在pyspark中，在s3上覆盖csv文件失败

问题描述

1 个解决方案

解决方案1
0 2019-06-12 14:11:57

在pyspark中，在s3上覆盖csv文件失败

问题描述

1 个解决方案

解决方案1 0 2019-06-12 14:11:57

解决方案1
0 2019-06-12 14:11:57