简体   繁体   English

Java在groupedRDD上触发带有key1的groupByKey并使用key2做aggregateByKey

[英]Java spark groupByKey with key1 and do aggregateByKey with key2 on groupedRDD

I am trying to do one simple java spark application which does the following 我正在尝试做一个简单的Java Spark应用程序,该应用程序执行以下操作

Input Data csv format : key1,key2,data1,data2 输入数据csv格式:key1,key2,data1,data2

Basically what I am trying to do here is, 基本上我想在这里做的是

First I am mapping each line by key1 and then doing a groupByKey operation on that rdd. 首先,我通过key1映射每行,然后在该rdd上执行groupByKey操作。

JavaRDD<String> viewRdd = sc.textFile("testfile.csv", 1);
JavaPairRDD<String, String> customerIdToRecordRDD = viewRdd
    .mapToPair(w -> new Tuple2<String, String>(w.split(",")[0], w));
JavaPairRDD<String, Iterable<String>> groupedByKey1RDD = customerIdToRecordRDD.groupByKey();
System.out.println(customerIdToRecordGropedRDD.count());

Now my problem is, I need to do an aggregateByKey with key2 on each group from groupedByKey1RDD. 现在我的问题是,我需要对来自groupedByKey1RDD的每个组上的key2做一个generateByKey。 Is there any way to convert Iterable into an RDD ?? 有什么方法可以将Iterable转换为RDD吗? or am I missing something here. 还是我在这里想念东西。 I am new to this, any help will be 我是新来的,任何帮助

Example input and expected output : 输入和预期输出示例:

id_1,time0,10,10 id_1,time0,10,10

id_2,time1,0,10 id_2,time1,0,10

id_1,time1,11,10 id_1,time1,11,10

id_1,time0,1,10 id_1,time0,1,10

id_2,time1,10,10 id_2,time1,10,10

Output is grouped by 1st column and then aggregated by 2nd column (aggregate logic is to simply add column3 and column4): 输出按第一列分组,然后按第二列聚合(聚合逻辑是简单地将column3和column4相加):

id_1 : time0 : { sum1 : 11, sum2 : 20} ,
       time1 : { sum1 : 11, sum2 : 10}

id_2 : time1 : { sum1 : 10, sum2 : 20} 

Here is the solution using Spark 2.0 and Dataframe. 这是使用Spark 2.0和Dataframe的解决方案。 Please let me know if you still want to use RDD. 如果您仍然想使用RDD,请告诉我。

public class SparkGroupBySample {
    public static void main(String[] args) {
    //SparkSession
    SparkSession spark = SparkSession
            .builder()
            .appName("SparkGroupBySample")
            .master("local")
            .getOrCreate();     
    //Schema
    StructType schema = new StructType(new StructField[] { 
            new StructField("key1", DataTypes.StringType, true, Metadata.empty()),
            new StructField("key2", DataTypes.StringType, true, Metadata.empty()),
            new StructField("data1", DataTypes.IntegerType, true, Metadata.empty()),
            new StructField("data2", DataTypes.IntegerType, true, Metadata.empty())});
    //Read csv
    Dataset<Row> dataSet = spark.read().format("csv").schema(schema).option("header", "true").option("delimiter", ",").load("c:\\temp\\sample.csv");
    dataSet.show();     
    //groupBy and aggregate
    Dataset<Row> dataSet1 = dataSet.groupBy("key1","key2").sum("data1","data2").toDF("key1","key2","sum1","sum2");
    dataSet1.show();
    //stop
    spark.stop();
   }
}

Here is the output. 这是输出。

+----+-----+----+----+
|key1| key2|sum1|sum2|
+----+-----+----+----+
|id_1|time1|  11|  10|
|id_2|time1|  10|  20|
|id_1|time0|  11|  20|
+----+-----+----+----+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Java 用于遍历对象数组的语法,例如“for (String key: {”key1“, ”key2“})” - Java syntax for iterating through array of objects like “for (String key : {”key1“, ”key2“})” Java如何使用模式[key1] value1 [key2] value2 [key3] value3解析字符串 - Java How To Parse String With Pattern [key1]value1[key2]value2[key3]value3 在使用Java Lambda(如Map)时,如何在group-2中使用group-1结果key1 <key1, Map<key2, List<Obj> &gt;&gt; - How to use group-1 result key1 in group-2 when use Java Lambda like Map<key1, Map<key2, List<Obj>>> 如何将{key1 = value1,key2 = value2}之类的字符串转换为json字符串或Jsonobject? - how do I convert a String like {key1=value1, key2=value2} into a json String or Jsonobject? 如何排序列表<Map<String,String> &gt; 如果 key1 具有相同的值,则按 key1 降序和 key2 升序 - How to sort List<Map<String,String>> by key1 descending order and key2 ascending order if key1 has same values 仅当key1 = value1和key2 = value2不存在另一个文档,其中key1 = value1和key2 = value3时,才从Elastic获取所有文档 - Get all documents from elastic if key1=value1 and key2=value2 only if there doesn't exists another document where key1=value1 and key2=value3 如何将“ key1:value1,value,key2:value3”字符串解析为ArrayLists? - How can I parse a “key1:value1, value, key2:value3” string into ArrayLists? 如何使用JCheckBox编号/命名集合中的复选框,即key1,key2等 - How to use JCheckBox to number/name check boxes in a collection i.e. key1, key2, etc 如何转换地图 <String,String> 使用lambda来“key1:value1,key2:value2,..” - How to convert Map<String,String> to “key1:value1,key2:value2,..” using lambda [key1:value1,key2:value2,key3:value3],[key21:value21,key22:value22]的正则表达式 - Regex for [key1:value1,key2:value2,key3:value3],[key21:value21,key22:value22]
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM