如何一次处理多个JavaRDD？

Question

I had a large dataset of format csv and i need to perform some RDD operations on this dataset without using any DataFrames/Dataset API and SparkSQL.By achieving this, i loaded each column of data into separate JavaRDD. 我有一个大型的csv格式的数据集，我需要在不使用任何DataFrames / Dataset API和SparkSQL的情况下对此数据集执行一些RDD操作。通过实现此目的，我将每一列数据加载到单独的JavaRDD中。

Here is my Sample dataset: 这是我的示例数据集：

id    name    address   rank
1001  john    NY        68
1002  kevin   NZ        72
1003  steve   WA        64

Here is the code i tried so far: 这是我到目前为止尝试过的代码：

JavaRDD<String> diskfile = sc.textFile("/Users/hadoop/Downloads/a.csv");
JavaRDD<String> idRDD=diskfile.flatMap(line -> Arrays.asList(line.split(",")[0]));
JavaRDD<String> nameRDD=diskfile.flatMap(line -> Arrays.asList(line.split(",")[1]));
JavaRDD<String> addressRDD=diskfile.flatMap(line -> Arrays.asList(line.split(",")[2]));

after this i applied reduceByKey on both addressRDD and nameRDD like this: 在此之后，我申请reduceByKey两个addressRDD和nameRDD是这样的：

JavaPairRDD<String,Integer> addresspair=address.mapToPair( t -> new Tuple2 <String,Integer>(t,1)).reduceByKey((x, y) -> x + y);
JavaPairRDD<String,Integer> namepair=nameRDD.mapToPair( t -> new Tuple2 <String,Integer>(t,1)).reduceByKey((x, y) -> x + y);

Problem: 问题：

I applied soryByVale(swap the key -values) on addresspair and get one address value( result ) which is occurred highest number of times. 我在地址对上应用了soryByVale（交换键值），并获得了出现次数最多的一个地址值（ result ）。 Now i need to return all required columns of csv file which contains address field as result . 现在，我需要返回包含地址字段作为result的csv文件的所有必需列。

Answer 1

You can use filter like below. 您可以使用如下所示的filter 。

JavaRDD<String> filteredData = diskfile.filter(add -> add.contains(result));
filteredData.foreach(data -> {
            System.out.println(data);
        });

如何一次处理多个JavaRDD？

问题描述

1 个解决方案

解决方案1
2 2016-12-20 07:36:11

如何一次处理多个JavaRDD？

问题描述

1 个解决方案

解决方案1 2 2016-12-20 07:36:11

解决方案1
2 2016-12-20 07:36:11