如何使用镶木地板在火花中读取和写入同一个文件？

Question

I am trying to read from a parquet file in spark, do a union with another rdd and then write the result into the same file I have read from (basically overwrite), this throws the following error:我正在尝试从 spark 中的 parquet 文件中读取数据，与另一个 rdd 进行联合，然后将结果写入我读取过的同一文件中（基本上是覆盖），这会引发以下错误：

 couldnt write parquet to file: An error occurred while calling o102.parquet.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenExchange hashpartitioning(billID#42,200), None
+- Union
   :- Scan ParquetRelation[units#35,price#36,priceSold#37,orderingTime#38,itemID#39,storeID#40,customerID#41,billID#42,sourceRef#43] InputPaths: hdfs://master-wat:8020/user/root/dataFile/parquet/general/NPM61LKK1C/Billbody
   +- Project [units#22,price#23,priceSold#24,orderingTime#25,itemID#26,storeID#27,customerID#28,billID#29,2 AS sourceRef#30]
      +- Scan ExistingRDD[units#22,price#23,priceSold#24,orderingTime#25,itemID#26,storeID#27,customerID#28,billID#29] 

    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
    at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:64)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.Window.doExecute(Window.scala:245)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:109)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
    at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
    at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
    at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: /user/root/dataFile/parquet/general/NPM61LKK1C/Billbody/part-r-00000-c51e45d3-6824-4fc2-9510-802e5379a86f.gz.parquet
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1934)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1875)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1855)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1827)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:566)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:88)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:361)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
    at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1222)
    at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1210)
    at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1260)
    at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:220)
    at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:216)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:216)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:208)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:395)
    at org.apache.parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:294)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$buildInternalScan$1$$anon$1.getPartitions(ParquetRelation.scala:363)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91)
    at org.apache.spark.sql.execution.Exchange.prepareShuffleDependency(Exchange.scala:220)
    at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:254)
    at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:248)
    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
    ... 56 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/root/dataFile/parquet/general/NPM61LKK1C/Billbody/part-r-00000-c51e45d3-6824-4fc2-9510-802e5379a86f.gz.parquet
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1934)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1875)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1855)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1827)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:566)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:88)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:361)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

    at org.apache.hadoop.ipc.Client.call(Client.java:1468)
    at org.apache.hadoop.ipc.Client.call(Client.java:1399)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
    at com.sun.proxy.$Proxy20.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:254)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy21.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1220)
    ... 92 more

which I am assuming means that when writing to the file, the original file is needed for the union and spark can't find the file any more.我假设这意味着在写入文件时，联合需要原始文件，并且 spark 无法再找到该文件。 I have tried caching what I have read from the parquet to avoid spark needing the file but that didn't work either.我已经尝试缓存我从镶木地板上读取的内容，以避免火花需要该文件，但这也不起作用。 Any help on Hadoop's best practice for doing this is greatly appreciated.非常感谢有关 Hadoop 最佳实践的任何帮助。

Answer 1

As spark does lazy transformation, it basically first wiped your destination directory and then goes and tries to read from source location.由于 spark 进行惰性转换，它基本上首先擦除您的目标目录，然后尝试从源位置读取。 Hence you are getting this error.因此，您收到此错误。

One possible way is to overcome this is to use collect on your data frame .解决此问题的一种可能方法是在您的数据框上使用 collect 。 To avoid getting OOM exception filter data and use collect()[1] .为避免获取 OOM 异常过滤器数据并使用 collect()[1] 。 This will force DAG to first read data and specify output to driver.这将强制 DAG 首先读取数据并指定输出到驱动程序。 And hence your data will be read before it's overwritten.因此，您的数据将在被覆盖之前被读取。

Answer 2

This causes a problem as you are reading and writing to the same location that you are trying to overwrite, it is Spark issue.这会导致问题，因为您正在读取和写入您尝试覆盖的同一位置，这是 Spark 问题。

The workaround is to store write your data in a temp folder, not inside the location you are working on, and read from it as the source to your initial location.解决方法是将写入的数据存储在临时文件夹中，而不是在您正在处理的位置内，并从它作为源读取到您的初始位置。

read from root/myfolder从根目录/我的文件夹中读取
make your data transformations进行数据转换
write transformed data into root/mytempfolder将转换后的数据写入 root/mytemp 文件夹
read from root/mytempfolder从 root/mytemp 文件夹读取
write into root/myfolder写入根目录/我的文件夹

Answer 3

您必须在模式下使用覆盖选项，请尝试使用追加

df.repartition(200).write.mode("append").parquet("path/parquet_name")

Answer 4

Just ran into the same issue...刚遇到同样的问题...

you need to cache the first rdd before union.您需要在联合之前cache第一个 rdd。 This would ensure it was read from disk into memory before you write to it.这将确保在您写入之前将其从磁盘读取到内存中。

val cached = first.cache()
cached.union(second).write.mode("overwrite").parquet("...")

Answer 5

尝试这个：

Df.write.format("parquet").mode("overwrite").insertInto(file_path)

Answer 6

You can use insertinto instead of save.您可以使用 insertinto 而不是 save。 It will work.它会起作用的。 Df.write.mode("parquet").mode("overwrite").insertInto(file_path) df.write.mode("parquet").mode("overwrite").insertInto(file_path)

如何使用镶木地板在火花中读取和写入同一个文件？

问题描述

6 个解决方案

解决方案1
3 2020-04-06 03:51:53

解决方案2
2 2022-01-11 09:53:40

解决方案3
0 2019-01-02 19:21:50

解决方案4
0 2019-01-14 03:00:54

解决方案5
0 2022-06-22 08:03:30

解决方案6
-1 2020-09-03 22:44:42

如何使用镶木地板在火花中读取和写入同一个文件？

问题描述

6 个解决方案

解决方案1 3 2020-04-06 03:51:53

解决方案2 2 2022-01-11 09:53:40

解决方案3 0 2019-01-02 19:21:50

解决方案4 0 2019-01-14 03:00:54

解决方案5 0 2022-06-22 08:03:30

解决方案6 -1 2020-09-03 22:44:42

解决方案1
3 2020-04-06 03:51:53

解决方案2
2 2022-01-11 09:53:40

解决方案3
0 2019-01-02 19:21:50

解决方案4
0 2019-01-14 03:00:54

解决方案5
0 2022-06-22 08:03:30

解决方案6
-1 2020-09-03 22:44:42