I have a dataset containing data like the following:
|c1| c2|
---------
| 1 | a |
| 1 | b |
| 1 | c |
| 2 | a |
| 2 | b |
...
Now, I want to get the data grouped like the following (col1: String Key, col2: List) :
| c1| c2 |
-----------
| 1 |a,b,c|
| 2 | a, b|
...
I thought that using goupByKey would be an sufficient solution, but I can't find any example, how to use it.
Can anyone help me to find a solution using groupByKey or using any other combination of transformations and actions to get this output by using datasets, not RDD?
Here is Spark 2.0 and Java example with Dataset.
public class SparkSample {
public static void main(String[] args) {
//SparkSession
SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.config("spark.sql.warehouse.dir", "/file:C:/temp")
.master("local")
.getOrCreate();
//input data
List<Tuple2<Integer,String>> inputList = new ArrayList<Tuple2<Integer,String>>();
inputList.add(new Tuple2<Integer,String>(1, "a"));
inputList.add(new Tuple2<Integer,String>(1, "b"));
inputList.add(new Tuple2<Integer,String>(1, "c"));
inputList.add(new Tuple2<Integer,String>(2, "a"));
inputList.add(new Tuple2<Integer,String>(2, "b"));
//dataset
Dataset<Row> dataSet = spark.createDataset(inputList, Encoders.tuple(Encoders.INT(), Encoders.STRING())).toDF("c1","c2");
dataSet.show();
//groupBy and aggregate
Dataset<Row> dataSet1 = dataSet.groupBy("c1").agg(org.apache.spark.sql.functions.collect_list("c2")).toDF("c1","c2");
dataSet1.show();
//stop
spark.stop();
}
}
With a DataFrame in Spark 2.0:
scala> val data = List((1, "a"), (1, "b"), (1, "c"), (2, "a"), (2, "b")).toDF("c1", "c2")
data: org.apache.spark.sql.DataFrame = [c1: int, c2: string]
scala> data.groupBy("c1").agg(collect_list("c2")).collect.foreach(println)
[1,WrappedArray(a, b, c)]
[2,WrappedArray(a, b)]
This will read the table in to dataset variable
Dataset<Row> datasetNew = dataset.groupBy("c1").agg(functions.collect_list("c2"));
datasetNew.show()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.