简体   繁体   English

在 scala 中对 dataframe 中的对象数组进行分组和聚合的最佳方法是什么

[英]What's the best way to group and aggregate an array of objects in a dataframe in scala

An example:一个例子:
_4 is a collection of count, date and tag that I want to group and sum _4 是我要分组和求和的计数、日期和标签的集合

|_1 |_2   |_3|_4                                                            |
|100|Scrap|12|{[{1, 2022-12-05, A}, {1, 2022-12-05, B}]}                    |
|100|Scrap|12|{[{1, 2022-12-06, A}]}                                        |
|100|Scrap|15|{[{2, 2022-12-07, A}, {2, 2022-12-02, A}, {2, 2022-12-03, C}]}|
|100|Scrap|15|{[{5, 2022-12-05, A}, {3, 2022-12-05, A}, {5, 2022-12-05, D}]}|

The output I'm hoping for is something like this which groups by the first 3 columns and the third element (tag) in the objects while summing the first element (count).我希望得到的 output 是这样的,它按对象中的前 3 列和第三个元素(标签)分组,同时对第一个元素(计数)求和。

|UID |Title|Cell|Data                 |
|100 |Scrap|12  |{[{2,A},{1,B}]       |
|100 |Scrap|15  |{[{12,A},{2,C},{5,D}]|

schema of the dataframe looks like this dataframe 的架构如下所示

|-- _1: long (nullable = false)
 |-- _2: string (nullable = true)
 |-- _3: long (nullable = false)
 |-- _4: struct (nullable = true)
 |    |-- data: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- count: integer (nullable = false)
 |    |    |    |-- date: date (nullable = true)
 |    |    |    |-- tag: string (nullable = true)

A straight forward approach would be to flatten the array content of column _4 via inline , followed by a couple of groupBy/agg as shown below:一种直接的方法是通过inline将列_4的数组内容展平,然后是几个groupBy/agg ,如下所示:

import java.sql.Date
case class Item(count: Int, date: Date, tag: String)
case class Items(data: Seq[Item])

val df = Seq(
  (100L, "Scrap", 12L, Items(Seq(Item(1, Date.valueOf("2022-12-05"), "A"), Item(1, Date.valueOf("2022-12-05"), "B")))),
  (100L, "Scrap", 12L, Items(Seq(Item(1, Date.valueOf("2022-12-06"), "A")))),
  (100L, "Scrap", 15L, Items(Seq(Item(2, Date.valueOf("2022-12-07"), "A"), Item(2, Date.valueOf("2022-12-02"), "A"), Item(2, Date.valueOf("2022-12-03"), "C")))),
  (100L, "Scrap", 15L, Items(Seq(Item(5, Date.valueOf("2022-12-05"), "A"), Item(3, Date.valueOf("2022-12-05"), "A"), Item(5, Date.valueOf("2022-12-05"), "D"))))
).toDF("_1", "_2", "_3", "_4")

df.
  select($"_1", $"_2", $"_3", expr("inline(_4.data)")).
  groupBy($"_1".as("UID"), $"_2".as("Title"), $"_3".as("Cell"), $"tag").agg(
    struct(sum($"count"), first($"tag")).as("TagSum")
  ).
  groupBy("UID", "Title", "Cell").agg(
    collect_list("TagSum").as("Data")
  ).
  show(false)
/*
+---+-----+----+-------------------------+
|UID|Title|Cell|Data                     |
+---+-----+----+-------------------------+
|100|Scrap|12  |[{1, B}, {2, A}]         |
|100|Scrap|15  |[{2, C}, {12, A}, {5, D}]|
+---+-----+----+-------------------------+
*/

The 1st groupBy groups the dataset by the key columns along with the struct field tag of _4.data elements to sum the count by tag , and the 2nd groupBy groups only by the key columns to aggregate for the wanted result.第一个groupBy按键列和_4.data元素的 struct 字段tag对数据集进行分组,以按tagcount求和,第二个groupBy仅按键列分组,以汇总所需结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM