[英]How to process several JavaRDDs all at once?
I had a large dataset of format csv
and i need to perform some RDD operations on this dataset without using any DataFrames/Dataset API and SparkSQL.By achieving this, i loaded each column of data into separate JavaRDD. 我有一个大型的csv
格式的数据集,我需要在不使用任何DataFrames / Dataset API和SparkSQL的情况下对此数据集执行一些RDD操作。通过实现此目的,我将每一列数据加载到单独的JavaRDD中。
Here is my Sample dataset: 这是我的示例数据集:
id name address rank
1001 john NY 68
1002 kevin NZ 72
1003 steve WA 64
Here is the code i tried so far: 这是我到目前为止尝试过的代码:
JavaRDD<String> diskfile = sc.textFile("/Users/hadoop/Downloads/a.csv");
JavaRDD<String> idRDD=diskfile.flatMap(line -> Arrays.asList(line.split(",")[0]));
JavaRDD<String> nameRDD=diskfile.flatMap(line -> Arrays.asList(line.split(",")[1]));
JavaRDD<String> addressRDD=diskfile.flatMap(line -> Arrays.asList(line.split(",")[2]));
after this i applied reduceByKey
on both addressRDD
and nameRDD
like this: 在此之后,我申请reduceByKey
两个addressRDD
和nameRDD
是这样的:
JavaPairRDD<String,Integer> addresspair=address.mapToPair( t -> new Tuple2 <String,Integer>(t,1)).reduceByKey((x, y) -> x + y);
JavaPairRDD<String,Integer> namepair=nameRDD.mapToPair( t -> new Tuple2 <String,Integer>(t,1)).reduceByKey((x, y) -> x + y);
Problem: 问题:
I applied soryByVale(swap the key -values) on addresspair and get one address value( result
) which is occurred highest number of times. 我在地址对上应用了soryByVale(交换键值),并获得了出现次数最多的一个地址值( result
)。 Now i need to return all required columns of csv file which contains address field as result
. 现在,我需要返回包含地址字段作为result
的csv文件的所有必需列。
You can use filter
like below. 您可以使用如下所示的filter
。
JavaRDD<String> filteredData = diskfile.filter(add -> add.contains(result));
filteredData.foreach(data -> {
System.out.println(data);
});
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.