How to use Spark Dataset API to aggregate Key-List?

Question

With Spark 2.x , starting from such a Dataset :

|keyword    |url
|A1         |url1
|A1         |url2
|A1         |url3
|A1         |url4
|A2         |url1
|A2         |url2
|A2         |url3

How could I obtain:

|keyword    |url
|A1         |url1,url2,url3,url4
|A2         |url1,url2,url3

Answer 1

Try this

import org.apache.spark.sql.functions._
val df = myDataset.groupBy("keyword").agg(collect_list("url"))

Using agg() with GroupBy() will let you do what you need under agg() you will get some methods like collect_set() , sum() etc.

Answer 2

In addition to accepted answer, if you wish to do the same thing in lambda way

        baseDS.rdd.filter { x => !x.getAs[String](0).contains("keyword") }.map { x =>
      (x.get(0), x.get(1))
    }.groupByKey().foreach(println(_))

Note: The filter() operation can be skipped with schema definition

Result

(A1,CompactBuffer(url1, url2, url3, url4))

(A2,CompactBuffer(url1, url2, url3))

How to use Spark Dataset API to aggregate Key-List?

Question

2 answers

solution1
4 ACCPTED 2017-03-22 09:53:40

solution2
0 2017-03-22 10:16:27

How to use Spark Dataset API to aggregate Key-List?

Question

2 answers

solution1 4 ACCPTED 2017-03-22 09:53:40

solution2 0 2017-03-22 10:16:27

solution1
4 ACCPTED 2017-03-22 09:53:40

solution2
0 2017-03-22 10:16:27