简体   繁体   English

如何一次处理多个JavaRDD?

[英]How to process several JavaRDDs all at once?

I had a large dataset of format csv and i need to perform some RDD operations on this dataset without using any DataFrames/Dataset API and SparkSQL.By achieving this, i loaded each column of data into separate JavaRDD. 我有一个大型的csv格式的数据集,我需要在不使用任何DataFrames / Dataset API和SparkSQL的情况下对此数据集执行一些RDD操作。通过实现此目的,我将每一列数据加载到单独的JavaRDD中。

Here is my Sample dataset: 这是我的示例数据集:

id    name    address   rank
1001  john    NY        68
1002  kevin   NZ        72
1003  steve   WA        64

Here is the code i tried so far: 这是我到目前为止尝试过的代码:

JavaRDD<String> diskfile = sc.textFile("/Users/hadoop/Downloads/a.csv");
JavaRDD<String> idRDD=diskfile.flatMap(line -> Arrays.asList(line.split(",")[0]));
JavaRDD<String> nameRDD=diskfile.flatMap(line -> Arrays.asList(line.split(",")[1]));
JavaRDD<String> addressRDD=diskfile.flatMap(line -> Arrays.asList(line.split(",")[2]));

after this i applied reduceByKey on both addressRDD and nameRDD like this: 在此之后,我申请reduceByKey两个addressRDDnameRDD是这样的:

JavaPairRDD<String,Integer> addresspair=address.mapToPair( t -> new Tuple2 <String,Integer>(t,1)).reduceByKey((x, y) -> x + y);
JavaPairRDD<String,Integer> namepair=nameRDD.mapToPair( t -> new Tuple2 <String,Integer>(t,1)).reduceByKey((x, y) -> x + y);

Problem: 问题:

I applied soryByVale(swap the key -values) on addresspair and get one address value( result ) which is occurred highest number of times. 我在地址对上应用了soryByVale(交换键值),并获得了出现次数最多的一个地址值( result )。 Now i need to return all required columns of csv file which contains address field as result . 现在,我需要返回包含地址字段作为result的csv文件的所有必需列。

You can use filter like below. 您可以使用如下所示的filter

JavaRDD<String> filteredData = diskfile.filter(add -> add.contains(result));
filteredData.foreach(data -> {
            System.out.println(data);
        });

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM