Error in Apache Spark while trying to read a file from hdfs (input path does not exist)

Question

I am getting the following error when I try to read a file with Spark from hdfs:

 scala> val textfile = sc.textFile("tmp/opendata/les-arbres.csv").collect()
17/10/09 19:02:31 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 341.4 KB, free 341.4 KB)
17/10/09 19:02:31 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 28.8 KB, free 370.2 KB)
17/10/09 19:02:31 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:45352 (size: 28.8 KB, free: 511.1 MB)
17/10/09 19:02:31 INFO SparkContext: Created broadcast 0 from textFile at <console>:27
17/10/09 19:02:31 INFO GPLNativeCodeLoader: Loaded native gpl library
17/10/09 19:02:31 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 7a4b57bedce694048432dd5bf5b90a6c8ccdba80]
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox.hortonworks.com:8020/user/root/tmp/opendata/les-arbres.csv
        at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1953)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:934)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:933)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:32)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:34)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:36)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:38)
        at $iwC$$iwC$$iwC.<init>(<console>:40)
        at $iwC$$iwC.<init>(<console>:42)
        at $iwC.<init>(<console>:44)
        at <init>(<console>:46)
        at .<init>(<console>:50)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

On the hdfs the file is existing:

[root@sandbox ~]# hdfs dfs -ls /tmp/opendata
Found 1 items
-rw-r--r--   3 maria_dev hdfs   43404192 2017-10-09 08:30 /tmp/opendata/les-arbres.csv

I am running the Hortonworks sandbox on an Oracle VM. I am really new to Spark and I don´t know why the error is there. Do I maybe need to configure Spark first, because it seems like Spark is connected to another HDFS?

Answer 1

As I see in your hdfs dfs -ls command , your tmp folder is not inside the /user/root/ folder. You just has to apply the following:

val textfile = sc.textFile("/tmp/opendata/les-arbres.csv").collect()

You must put a "/" character before the path where are you going to look for the file.

Error in Apache Spark while trying to read a file from hdfs (input path does not exist)

Question

1 answers

solution1
0 ACCPTED 2017-10-09 21:19:32

Error in Apache Spark while trying to read a file from hdfs (input path does not exist)

Question

1 answers

solution1 0 ACCPTED 2017-10-09 21:19:32

solution1
0 ACCPTED 2017-10-09 21:19:32