简体   繁体   English

Java Spark无法处理WholeTextFiles

[英]Java Spark unable to process wholeTextFiles

I am new to Spark API and learning along the way. 我是Spark API新手,并且一直在学习。

I have multiple files in a hadoop directory which I am reading using wholeTextFiles to create JavaPairRDD<String, String> . 我在hadoop目录中有多个文件,正在使用WholeTextFiles读取这些文件来创建JavaPairRDD<String, String> Program is in Java. 程序使用Java。

My requirement is to process list of files in a directory and achieve following output: 我的要求是处理目录中文件的列表并实现以下输出:

file-path, word 文件路径,单词

file-path, word 文件路径,单词

file-path, word 文件路径,单词

... ...

This is basically word content of files with the corresponding file name (or path) paired as <String, String> . 这基本上是文件的单词内容,具有相应的文件名(或路径),配对为<String, String>

I tried following however casting from tuple2 to Iterable is not allowed (failed at run time): 我尝试了以下操作,但是不允许将其从tuple2强制转换为Iterable(在运行时失败):

JavaSparkContext sc = new JavaSparkContext(new SparkConf()) ;

JavaPairRDD<String, String> files = sc.wholeTextFiles(args[0]);

JavaRDD<Tuple2<String, String>> file_word = files
.flatMap(new FlatMapFunction<Tuple2<String,String>, Tuple2<String,String>>()
{
public Iterable<Tuple2<String, String>> call(Tuple2<String, String> tuple) 
{
 return (Iterable<Tuple2<String, String>>) new Tuple2<String, Iterable<String>>(tuple._1(), Arrays.asList(((String) tuple._2()).toLowerCase ().split("\\W+")));
}
});

I am using Java 8 , Hadoop2 with Spark 2.2.0 . 我正在使用Java 8 ,带有Spark 2.2.0 Hadoop2

(By looking at other questions here I can understand writing this in scala is easier however I did not find relevant answer for Java) (通过查看此处的其他问题,我可以理解在scala中编写此代码更容易,但是我没有找到有关Java的相关答案)

Looking for solution. 寻找解决方案。 Thank you. 谢谢。

From what I see, you are trying to cast an Tuple2 into an Iterable, which cannot work. 从我的角度来看,您正在尝试将Tuple2转换为Iterable,这无法正常工作。

Since you are using java8, you can write this with a lambda expression which will make things much more compact: 由于您使用的是Java8,因此可以使用lambda表达式编写此代码,这会使事情变得更加紧凑:

JavaPairRDD<String, String> rdd = sc
            .wholeTextFiles("path_to_data/*")
            .flatMapValues(x -> Arrays.asList(x.split("\\W+")));

Note that I am using flatMapValues instead of flatMap because you only need to process the second value of the tuple. 请注意,我使用的是flatMapValues而不是flatMap因为您只需要处理元组的第二个值。

In case you are curious, with flatmap you could have done it by mapping each word of your file to a tuple (fileName, word): 如果您好奇,可以使用flatmap将文件中的每个单词映射到一个元组(fileName,word)来实现:

JavaRDD<Tuple2<String, String>> rdd2 = sc
            .wholeTextFiles("path_to_data/*")
            .flatMap(x -> Arrays.asList(x._2.split("\\n"))
                    .stream()
                    .map(w -> new Tuple2<>(x._1, w))
                    .iterator());

flatMapValues simply enables you to do that with less code ;-) flatMapValues只是使您可以用更少的代码来做到这一点;-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM