SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

Question

I'm trying to use apache Spark sc.wholeTextFiles() on file that is stored amazon S3 I'm getting following Error:

14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
Traceback (most recent call last):
File "/root/distributed_rdd_test.py", line 27, in <module>
result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
File "/root/spark/python/pyspark/rdd.py", line 1126, in take
totalParts = self._jrdd.partitions().size()
File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions.
: java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

My code is following:

sc = SparkContext(appName="Process wiki")
distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput') 
result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
for item in result:
    print item.getvalue()
sc.stop()

So my question is, is it possible to read whole files from S3 in Spark? Based on the documentation it should be possible, but it seems that it does not work for me.

When I do just:

sc = SparkContext(appName="Process wiki")
distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10)
print distData

Then the error that I'm getting is exactly the same.

EDIT:

Of course I've tried sc.textFile('s3n://wiki-dump/wikiinput'), which reads the file without any problems.

EDIT2:

I've tried also to run the same code from Scala and I'm getting still the same error. Particularly I'm trying to run val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first()

Answer 1

Since the error message points to a specific file that you did not specify ( /wikiinput/wiki.xml.gz ), that file should either be corrupt or you don't the right permissions to access it.

Are you using the latest version of Spark? I think the Python lagged a bit behind in older Spark versions.

And what input does gensim.corpora.wikicorpus.extract_pages expect? I'm just curious because /wikiinput/wiki.xml.gz contains neither protocol nor bucket, and thus might simply fail to address the correct file. When I'm using wholeTextFiles from Scala and on HDFS, the file name is hdfs://<host>:<port>/path/to/file .

Answer 2

The problem seems to be not principally with spark, but with the version of the Hadoop libs linked. I was getting this when using spark 1.3.0 with Hadoop 1, but do not see it when using Hadoop 2. If you need this method to work with s3, be sure to install a version of spark linked to the Hadoop 2 libs. Specifically, if you're using the spark-ec2 script to set up a cluster on AWS, be sure to include the option --hadoop-major-version=2

Full details can be found here: https://issues.apache.org/jira/browse/SPARK-4414

SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

Question

2 answers

solution1
0 2014-10-08 14:45:38

solution2
0 2015-04-28 19:55:20

SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

Question

2 answers

solution1 0 2014-10-08 14:45:38

solution2 0 2015-04-28 19:55:20

solution1
0 2014-10-08 14:45:38

solution2
0 2015-04-28 19:55:20