简体   繁体   中英

Java Spark unable to process wholeTextFiles

I am new to Spark API and learning along the way.

I have multiple files in a hadoop directory which I am reading using wholeTextFiles to create JavaPairRDD<String, String> . Program is in Java.

My requirement is to process list of files in a directory and achieve following output:

file-path, word

file-path, word

file-path, word

...

This is basically word content of files with the corresponding file name (or path) paired as <String, String> .

I tried following however casting from tuple2 to Iterable is not allowed (failed at run time):

JavaSparkContext sc = new JavaSparkContext(new SparkConf()) ;

JavaPairRDD<String, String> files = sc.wholeTextFiles(args[0]);

JavaRDD<Tuple2<String, String>> file_word = files
.flatMap(new FlatMapFunction<Tuple2<String,String>, Tuple2<String,String>>()
{
public Iterable<Tuple2<String, String>> call(Tuple2<String, String> tuple) 
{
 return (Iterable<Tuple2<String, String>>) new Tuple2<String, Iterable<String>>(tuple._1(), Arrays.asList(((String) tuple._2()).toLowerCase ().split("\\W+")));
}
});

I am using Java 8 , Hadoop2 with Spark 2.2.0 .

(By looking at other questions here I can understand writing this in scala is easier however I did not find relevant answer for Java)

Looking for solution. Thank you.

From what I see, you are trying to cast an Tuple2 into an Iterable, which cannot work.

Since you are using java8, you can write this with a lambda expression which will make things much more compact:

JavaPairRDD<String, String> rdd = sc
            .wholeTextFiles("path_to_data/*")
            .flatMapValues(x -> Arrays.asList(x.split("\\W+")));

Note that I am using flatMapValues instead of flatMap because you only need to process the second value of the tuple.

In case you are curious, with flatmap you could have done it by mapping each word of your file to a tuple (fileName, word):

JavaRDD<Tuple2<String, String>> rdd2 = sc
            .wholeTextFiles("path_to_data/*")
            .flatMap(x -> Arrays.asList(x._2.split("\\n"))
                    .stream()
                    .map(w -> new Tuple2<>(x._1, w))
                    .iterator());

flatMapValues simply enables you to do that with less code ;-)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM