向EMR提交spark作业时找不到文件异常

Question

We have a spark job that runs fine in local standalone mode. 我们有一个在本地独立模式下运行良好的spark作业。 We have submitted it to aws EMR-5.0 (spark 2.0, hadoop 2.7.2) and are receiving the following error: 我们已将其提交给aws EMR-5.0（spark 2.0，hadoop 2.7.2）并收到以下错误：

java.io.FileNotFoundException: File does not exist: hdfs://ip.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1470941709244_0001/__spark_libs__3533384422462530422.zip
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Failing this attempt. Failing the application.
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1470941880009
     final status: FAILED
     tracking URL: http://ip.us-west-2.compute.internal:8088/cluster/app/application_1470941709244_0001
     user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1470941709244_0001 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

We are submitting the job in 'cluster' mode with the following spark-submit options: --class com.company.project.Preprocess and have the jar stored in S3. 我们使用以下spark-submit选项以“集群”模式提交作业： - class com.company.project.Preprocess并将jar存储在S3中。 Does anyone know what might be causing this error? 有谁知道可能导致此错误的原因是什么？

Answer 1

Looks like JDK version mismatch. 看起来像JDK版本不匹配。 Please check if you are running with EMR supported Java 7 or set below EMR configuration for Java 8 请检查您是否使用EMR支持的Java 7运行，或者设置为Java 8的EMR配置

[
    {
        "Classification": "hadoop-env",
        "Configurations": [
            {
                "Classification": "export",
                "Configurations": [],
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ],
        "Properties": {}
    },
    {
        "Classification": "spark-env",
        "Configurations": [
            {
                "Classification": "export",
                "Configurations": [],
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ],
        "Properties": {}
    }
]

Answer 2

Check whether you have defined SparkContext properly. 检查您是否正确定义了SparkContext。 While trying to deploy in cluster mode do not set the option master. 尝试在群集模式下部署时，请勿设置选项master。

You can define SparkContext as follows 您可以按如下方式定义SparkContext

    val sc = new SparkContext(new SparkConf().setAppName("ApplicationName"))

向EMR提交spark作业时找不到文件异常

问题描述

2 个解决方案

解决方案1
0 2017-06-14 07:09:45

解决方案2
0 2018-01-04 11:38:42

向EMR提交spark作业时找不到文件异常

问题描述

2 个解决方案

解决方案1 0 2017-06-14 07:09:45

解决方案2 0 2018-01-04 11:38:42

解决方案1
0 2017-06-14 07:09:45

解决方案2
0 2018-01-04 11:38:42