简体   繁体   English

如何使用Spark Dataset API汇总键列表?

[英]How to use Spark Dataset API to aggregate Key-List?

With Spark 2.x , starting from such a Dataset : 使用Spark 2.x ,从这样的Dataset

|keyword    |url
|A1         |url1
|A1         |url2
|A1         |url3
|A1         |url4
|A2         |url1
|A2         |url2
|A2         |url3

How could I obtain: 我如何获得:

|keyword    |url
|A1         |url1,url2,url3,url4
|A2         |url1,url2,url3

Try this 尝试这个

import org.apache.spark.sql.functions._
val df = myDataset.groupBy("keyword").agg(collect_list("url"))

Using agg() with GroupBy() will let you do what you need under agg() you will get some methods like collect_set() , sum() etc. agg()GroupBy()将使您可以在agg()下执行所需的操作,您将获得诸如collect_set()sum()等方法。

In addition to accepted answer, if you wish to do the same thing in lambda way 除了接受的答案,如果您希望以lambda方式执行相同的操作

        baseDS.rdd.filter { x => !x.getAs[String](0).contains("keyword") }.map { x =>
      (x.get(0), x.get(1))
    }.groupByKey().foreach(println(_))

Note: The filter() operation can be skipped with schema definition 注意:可以使用架构定义跳过filter()操作

Result 结果

(A1,CompactBuffer(url1, url2, url3, url4)) (A1,CompactBuffer(url1,url2,url3,url4))

(A2,CompactBuffer(url1, url2, url3)) (A2,CompactBuffer(url1,url2,url3))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM