[英]What's the best way to group and aggregate an array of objects in a dataframe in scala
An example:一个例子:
_4 is a collection of count, date and tag that I want to group and sum _4 是我要分组和求和的计数、日期和标签的集合
|_1 |_2 |_3|_4 |
|100|Scrap|12|{[{1, 2022-12-05, A}, {1, 2022-12-05, B}]} |
|100|Scrap|12|{[{1, 2022-12-06, A}]} |
|100|Scrap|15|{[{2, 2022-12-07, A}, {2, 2022-12-02, A}, {2, 2022-12-03, C}]}|
|100|Scrap|15|{[{5, 2022-12-05, A}, {3, 2022-12-05, A}, {5, 2022-12-05, D}]}|
The output I'm hoping for is something like this which groups by the first 3 columns and the third element (tag) in the objects while summing the first element (count).我希望得到的 output 是这样的,它按对象中的前 3 列和第三个元素(标签)分组,同时对第一个元素(计数)求和。
|UID |Title|Cell|Data |
|100 |Scrap|12 |{[{2,A},{1,B}] |
|100 |Scrap|15 |{[{12,A},{2,C},{5,D}]|
schema of the dataframe looks like this dataframe 的架构如下所示
|-- _1: long (nullable = false)
|-- _2: string (nullable = true)
|-- _3: long (nullable = false)
|-- _4: struct (nullable = true)
| |-- data: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- count: integer (nullable = false)
| | | |-- date: date (nullable = true)
| | | |-- tag: string (nullable = true)
A straight forward approach would be to flatten the array content of column _4
via inline
, followed by a couple of groupBy/agg
as shown below:一种直接的方法是通过
inline
将列_4
的数组内容展平,然后是几个groupBy/agg
,如下所示:
import java.sql.Date
case class Item(count: Int, date: Date, tag: String)
case class Items(data: Seq[Item])
val df = Seq(
(100L, "Scrap", 12L, Items(Seq(Item(1, Date.valueOf("2022-12-05"), "A"), Item(1, Date.valueOf("2022-12-05"), "B")))),
(100L, "Scrap", 12L, Items(Seq(Item(1, Date.valueOf("2022-12-06"), "A")))),
(100L, "Scrap", 15L, Items(Seq(Item(2, Date.valueOf("2022-12-07"), "A"), Item(2, Date.valueOf("2022-12-02"), "A"), Item(2, Date.valueOf("2022-12-03"), "C")))),
(100L, "Scrap", 15L, Items(Seq(Item(5, Date.valueOf("2022-12-05"), "A"), Item(3, Date.valueOf("2022-12-05"), "A"), Item(5, Date.valueOf("2022-12-05"), "D"))))
).toDF("_1", "_2", "_3", "_4")
df.
select($"_1", $"_2", $"_3", expr("inline(_4.data)")).
groupBy($"_1".as("UID"), $"_2".as("Title"), $"_3".as("Cell"), $"tag").agg(
struct(sum($"count"), first($"tag")).as("TagSum")
).
groupBy("UID", "Title", "Cell").agg(
collect_list("TagSum").as("Data")
).
show(false)
/*
+---+-----+----+-------------------------+
|UID|Title|Cell|Data |
+---+-----+----+-------------------------+
|100|Scrap|12 |[{1, B}, {2, A}] |
|100|Scrap|15 |[{2, C}, {12, A}, {5, D}]|
+---+-----+----+-------------------------+
*/
The 1st groupBy
groups the dataset by the key columns along with the struct field tag
of _4.data
elements to sum the count
by tag
, and the 2nd groupBy
groups only by the key columns to aggregate for the wanted result.第一个
groupBy
按键列和_4.data
元素的 struct 字段tag
对数据集进行分组,以按tag
对count
求和,第二个groupBy
仅按键列分组,以汇总所需结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.