With Spark 2.x
, starting from such a Dataset
:
|keyword |url
|A1 |url1
|A1 |url2
|A1 |url3
|A1 |url4
|A2 |url1
|A2 |url2
|A2 |url3
How could I obtain:
|keyword |url
|A1 |url1,url2,url3,url4
|A2 |url1,url2,url3
Try this
import org.apache.spark.sql.functions._
val df = myDataset.groupBy("keyword").agg(collect_list("url"))
Using agg()
with GroupBy()
will let you do what you need under agg()
you will get some methods like collect_set()
, sum()
etc.
In addition to accepted answer, if you wish to do the same thing in lambda way
baseDS.rdd.filter { x => !x.getAs[String](0).contains("keyword") }.map { x =>
(x.get(0), x.get(1))
}.groupByKey().foreach(println(_))
Note: The filter() operation can be skipped with schema definition
Result
(A1,CompactBuffer(url1, url2, url3, url4))
(A2,CompactBuffer(url1, url2, url3))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.