简体   繁体   English

在纱线群集模式下的Apache Spark正在抛出Hadoop FileAlreadyExistsException

[英]Apache Spark in yarn-cluster mode is throwing Hadoop FileAlreadyExistsException

I am trying to execute my Spark job in yarn-cluster mode. 我试图在纱线群集模式下执行我的Spark工作。 It is working fine with standalone and yarn-client mode, but in cluster mode it is throwing FileAlreadyExistsException at pairs.saveAsTextFile(output); 它在独立和纱线客户端模式下运行良好,但在集群模式下,它会在pairs.saveAsTextFile(output);抛出FileAlreadyExistsException pairs.saveAsTextFile(output);

Here is my implementation of job: 这是我的工作实施:

SparkConf sparkConf = new SparkConf().setAppName("LIM Spark PolygonFilter").setMaster(master);  
        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);            
        Broadcast<IGeometry> boundryBroadCaster = broadcastBoundry(javaSparkContext, boundaryPath);         
        JavaRDD<String> file = javaSparkContext.textFile(input);//.cache();     
        JavaRDD<String> pairs = file.filter(new FilterFunction(params , boundryBroadCaster));
        pairs.saveAsTextFile(output);

As per logs, it works for one node and after that it start throwing this exception for rest of all nodes. 根据日志,它适用于一个节点,然后它开始为所有节点的其余节点抛出此异常。

Can someone please help me to fix it ... ? 有人可以帮我修理一下......? Thanks. 谢谢。

After disabling output spec it is working: (spark.hadoop.validateOutputSpecs=true). 禁用输出规范后,它正在工作:(spark.hadoop.validateOutputSpecs = true)。

It looks like a feature of Hadoop to notify user that the specified output directory is already has some data and it will be lost if you will use same directory for next iteration of this job. 它看起来像Hadoop的一个功能,通知用户指定的输出目录已经有一些数据,如果您将使用相同的目录进行此作业的下一次迭代,它将丢失。

In my application i provided an extra parameter for job - -overwrite, and we are using it like this: 在我的应用程序中,我为job提供了一个额外的参数 - -overwrite,我们正在使用它:

spark.hadoop.validateOutputSpecs = value of overwrite flag

If user want to overwrite existing output than he can provide value of "overwrite" flag as true. 如果用户想要覆盖现有输出,则可以将“overwrite”标志的值设置为true。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我们可以在纱线簇上运行apache spark 1.1.0吗? - Can we run apache spark 1.1.0 on yarn-cluster? 如果在类内未创建火花会话,则在纱线群集模式下,火花作业的最终状态将失败 - Spark Jobs final status is failed in yarn-cluster mode if no spark session is created inside the class 远程 flink 作业查询 Hive 上的纱线集群错误:NoClassDefFoundError: org/apache/hadoop/mapred/JobConf - remote flink job with query to Hive on yarn-cluster error:NoClassDefFoundError: org/apache/hadoop/mapred/JobConf 使用--jars的spark-submit yarn-cluster不起作用? - spark-submit yarn-cluster with --jars does not work? Apache SPARK:广播变量的-Nullpointer异常(YARN群集模式) - Apache SPARK:-Nullpointer Exception on broadcast variables (YARN Cluster mode) org.apache.hadoop.mapred.FileAlreadyExistsException - org.apache.hadoop.mapred.FileAlreadyExistsException 无法在YARN群集(Hadoop 2.5.2)上运行Apache Giraph - Trouble running Apache Giraph on YARN cluster (Hadoop 2.5.2) Spark 在 Yarn Cluster 模式下提交,并将配置文件放入 HDFS 问题 - Spark submit on Yarn Cluster mode with config file put into HDFS issue 在集群模式下将Spark从eclipse部署到YARN时出错 - Error when deploying Spark from eclipse to YARN in cluster mode Amazon EMR中的org.apache.hadoop.mapred.FileAlreadyExistsException - org.apache.hadoop.mapred.FileAlreadyExistsException in Amazon EMR
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM