[英]Java spark groupByKey with key1 and do aggregateByKey with key2 on groupedRDD
I am trying to do one simple java spark application which does the following 我正在尝试做一个简单的Java Spark应用程序,该应用程序执行以下操作
Input Data csv format : key1,key2,data1,data2 输入数据csv格式:key1,key2,data1,data2
Basically what I am trying to do here is, 基本上我想在这里做的是
First I am mapping each line by key1 and then doing a groupByKey operation on that rdd. 首先,我通过key1映射每行,然后在该rdd上执行groupByKey操作。
JavaRDD<String> viewRdd = sc.textFile("testfile.csv", 1);
JavaPairRDD<String, String> customerIdToRecordRDD = viewRdd
.mapToPair(w -> new Tuple2<String, String>(w.split(",")[0], w));
JavaPairRDD<String, Iterable<String>> groupedByKey1RDD = customerIdToRecordRDD.groupByKey();
System.out.println(customerIdToRecordGropedRDD.count());
Now my problem is, I need to do an aggregateByKey with key2 on each group from groupedByKey1RDD. 现在我的问题是,我需要对来自groupedByKey1RDD的每个组上的key2做一个generateByKey。 Is there any way to convert Iterable into an RDD ??
有什么方法可以将Iterable转换为RDD吗? or am I missing something here.
还是我在这里想念东西。 I am new to this, any help will be
我是新来的,任何帮助
Example input and expected output : 输入和预期输出示例:
id_1,time0,10,10
id_1,time0,10,10
id_2,time1,0,10
id_2,time1,0,10
id_1,time1,11,10
id_1,time1,11,10
id_1,time0,1,10
id_1,time0,1,10
id_2,time1,10,10
id_2,time1,10,10
Output is grouped by 1st column and then aggregated by 2nd column (aggregate logic is to simply add column3 and column4): 输出按第一列分组,然后按第二列聚合(聚合逻辑是简单地将column3和column4相加):
id_1 : time0 : { sum1 : 11, sum2 : 20} ,
time1 : { sum1 : 11, sum2 : 10}
id_2 : time1 : { sum1 : 10, sum2 : 20}
Here is the solution using Spark 2.0 and Dataframe. 这是使用Spark 2.0和Dataframe的解决方案。 Please let me know if you still want to use RDD.
如果您仍然想使用RDD,请告诉我。
public class SparkGroupBySample {
public static void main(String[] args) {
//SparkSession
SparkSession spark = SparkSession
.builder()
.appName("SparkGroupBySample")
.master("local")
.getOrCreate();
//Schema
StructType schema = new StructType(new StructField[] {
new StructField("key1", DataTypes.StringType, true, Metadata.empty()),
new StructField("key2", DataTypes.StringType, true, Metadata.empty()),
new StructField("data1", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("data2", DataTypes.IntegerType, true, Metadata.empty())});
//Read csv
Dataset<Row> dataSet = spark.read().format("csv").schema(schema).option("header", "true").option("delimiter", ",").load("c:\\temp\\sample.csv");
dataSet.show();
//groupBy and aggregate
Dataset<Row> dataSet1 = dataSet.groupBy("key1","key2").sum("data1","data2").toDF("key1","key2","sum1","sum2");
dataSet1.show();
//stop
spark.stop();
}
}
Here is the output. 这是输出。
+----+-----+----+----+
|key1| key2|sum1|sum2|
+----+-----+----+----+
|id_1|time1| 11| 10|
|id_2|time1| 10| 20|
|id_1|time0| 11| 20|
+----+-----+----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.