File not found exception when submitting spark job to EMR

Question

We have a spark job that runs fine in local standalone mode. We have submitted it to aws EMR-5.0 (spark 2.0, hadoop 2.7.2) and are receiving the following error:

java.io.FileNotFoundException: File does not exist: hdfs://ip.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1470941709244_0001/__spark_libs__3533384422462530422.zip
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Failing this attempt. Failing the application.
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1470941880009
     final status: FAILED
     tracking URL: http://ip.us-west-2.compute.internal:8088/cluster/app/application_1470941709244_0001
     user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1470941709244_0001 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

We are submitting the job in 'cluster' mode with the following spark-submit options: --class com.company.project.Preprocess and have the jar stored in S3. Does anyone know what might be causing this error?

Answer 1

Looks like JDK version mismatch. Please check if you are running with EMR supported Java 7 or set below EMR configuration for Java 8

[
    {
        "Classification": "hadoop-env",
        "Configurations": [
            {
                "Classification": "export",
                "Configurations": [],
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ],
        "Properties": {}
    },
    {
        "Classification": "spark-env",
        "Configurations": [
            {
                "Classification": "export",
                "Configurations": [],
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                }
            }
        ],
        "Properties": {}
    }
]

Answer 2

Check whether you have defined SparkContext properly. While trying to deploy in cluster mode do not set the option master.

You can define SparkContext as follows

    val sc = new SparkContext(new SparkConf().setAppName("ApplicationName"))

File not found exception when submitting spark job to EMR

Question

2 answers

solution1
0 2017-06-14 07:09:45

solution2
0 2018-01-04 11:38:42

File not found exception when submitting spark job to EMR

Question

2 answers

solution1 0 2017-06-14 07:09:45

solution2 0 2018-01-04 11:38:42

solution1
0 2017-06-14 07:09:45

solution2
0 2018-01-04 11:38:42