简体   繁体   中英

GroupByKey with datasets in Spark 2.0 using Java

I have a dataset containing data like the following:

|c1| c2|
---------
| 1 | a |
| 1 | b |
| 1 | c |
| 2 | a |
| 2 | b |

...

Now, I want to get the data grouped like the following (col1: String Key, col2: List) :

| c1| c2 |
-----------
| 1 |a,b,c|
| 2 | a, b|
...

I thought that using goupByKey would be an sufficient solution, but I can't find any example, how to use it.

Can anyone help me to find a solution using groupByKey or using any other combination of transformations and actions to get this output by using datasets, not RDD?

Here is Spark 2.0 and Java example with Dataset.

public class SparkSample {
    public static void main(String[] args) {
    //SparkSession
    SparkSession spark = SparkSession
            .builder()
            .appName("SparkSample")
            .config("spark.sql.warehouse.dir", "/file:C:/temp")
            .master("local")
            .getOrCreate();     
    //input data
    List<Tuple2<Integer,String>> inputList = new ArrayList<Tuple2<Integer,String>>();
    inputList.add(new Tuple2<Integer,String>(1, "a"));
    inputList.add(new Tuple2<Integer,String>(1, "b"));
    inputList.add(new Tuple2<Integer,String>(1, "c"));
    inputList.add(new Tuple2<Integer,String>(2, "a"));
    inputList.add(new Tuple2<Integer,String>(2, "b"));          
    //dataset
    Dataset<Row> dataSet = spark.createDataset(inputList, Encoders.tuple(Encoders.INT(), Encoders.STRING())).toDF("c1","c2");
    dataSet.show();     
    //groupBy and aggregate
    Dataset<Row> dataSet1 = dataSet.groupBy("c1").agg(org.apache.spark.sql.functions.collect_list("c2")).toDF("c1","c2");
    dataSet1.show();
    //stop
    spark.stop();
  }
}

With a DataFrame in Spark 2.0:

scala> val data = List((1, "a"), (1, "b"), (1, "c"), (2, "a"), (2, "b")).toDF("c1", "c2")
data: org.apache.spark.sql.DataFrame = [c1: int, c2: string]
scala> data.groupBy("c1").agg(collect_list("c2")).collect.foreach(println)
[1,WrappedArray(a, b, c)]
[2,WrappedArray(a, b)]

This will read the table in to dataset variable

Dataset<Row> datasetNew = dataset.groupBy("c1").agg(functions.collect_list("c2"));
datasetNew.show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM