[英]Apache Spark in yarn-cluster mode is throwing Hadoop FileAlreadyExistsException
I am trying to execute my Spark job in yarn-cluster mode. 我试图在纱线群集模式下执行我的Spark工作。 It is working fine with standalone and yarn-client mode, but in cluster mode it is throwing
FileAlreadyExistsException
at pairs.saveAsTextFile(output);
它在独立和纱线客户端模式下运行良好,但在集群模式下,它会在
pairs.saveAsTextFile(output);
抛出FileAlreadyExistsException
pairs.saveAsTextFile(output);
Here is my implementation of job: 这是我的工作实施:
SparkConf sparkConf = new SparkConf().setAppName("LIM Spark PolygonFilter").setMaster(master);
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
Broadcast<IGeometry> boundryBroadCaster = broadcastBoundry(javaSparkContext, boundaryPath);
JavaRDD<String> file = javaSparkContext.textFile(input);//.cache();
JavaRDD<String> pairs = file.filter(new FilterFunction(params , boundryBroadCaster));
pairs.saveAsTextFile(output);
As per logs, it works for one node and after that it start throwing this exception for rest of all nodes. 根据日志,它适用于一个节点,然后它开始为所有节点的其余节点抛出此异常。
Can someone please help me to fix it ... ? 有人可以帮我修理一下......? Thanks.
谢谢。
After disabling output spec it is working: (spark.hadoop.validateOutputSpecs=true). 禁用输出规范后,它正在工作:(spark.hadoop.validateOutputSpecs = true)。
It looks like a feature of Hadoop to notify user that the specified output directory is already has some data and it will be lost if you will use same directory for next iteration of this job. 它看起来像Hadoop的一个功能,通知用户指定的输出目录已经有一些数据,如果您将使用相同的目录进行此作业的下一次迭代,它将丢失。
In my application i provided an extra parameter for job - -overwrite, and we are using it like this: 在我的应用程序中,我为job提供了一个额外的参数 - -overwrite,我们正在使用它:
spark.hadoop.validateOutputSpecs = value of overwrite flag
If user want to overwrite existing output than he can provide value of "overwrite" flag as true. 如果用户想要覆盖现有输出,则可以将“overwrite”标志的值设置为true。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.