[英]Spark Scala groupBy multiple columns with values
我在 spark 中有以下數據框 ( df
)
| group_1 | group_2 | year | value |
| "School1" | "Student" | 2018 | name_aaa |
| "School1" | "Student" | 2018 | name_bbb |
| "School1" | "Student" | 2019 | name_aaa |
| "School2" | "Student" | 2019 | name_aaa |
我想要的是
| group_1 | group_2 | values_map |
| "School1" | "Student" | [2018 -> [name_aaa, name_bbb], [2019 -> [name_aaa] |
| "School2" | "Student" | [2019 -> [name_aaa] |
我用groupBy
和collect_list()
& map()
嘗試過,但沒有用。 它創建了一個只有來自name_aaa
或name_bbb
最后一個值的name_bbb
。 如何使用 Apache Spark 實現這一目標?
另一個答案的結果是數組類型而不是映射。 這是為您的結果實現map
類型列的方法。
df.groupBy("group_1", "group_2", "year").agg(collect_list("value").as("value_list"))
.groupBy("group_1", "group_2").agg(collect_list(struct(col("year"), col("value_list"))).as("map_list"))
.withColumn("values_map", map_from_entries(col("map_list")))
.drop("map_list")
.show(false)
我沒用過udf
。 然后,結果直接顯示您期望的結果。
+-------+-------+--------------------------------------------------+
|group_1|group_2|values_map |
+-------+-------+--------------------------------------------------+
|School2|Student|[2019 -> [name_aaa]] |
|School1|Student|[2018 -> [name_aaa, name_bbb], 2019 -> [name_aaa]]|
+-------+-------+--------------------------------------------------+
解決方案可能是:
scala> df1.show
+-------+-------+----+--------+
|group_1|group_2|year| value|
+-------+-------+----+--------+
|school1|student|2018|name_aaa|
|school1|student|2018|name_bbb|
|school1|student|2019|name_aaa|
|school2|student|2019|name_aaa|
+-------+-------+----+--------+
scala> val df2 = df1.groupBy("group_1","group_2","year").agg(collect_list('value).as("value"))
df2: org.apache.spark.sql.DataFrame = [group_1: string, group_2: string ... 2 more fields]
scala> df2.show
+-------+-------+----+--------------------+
|group_1|group_2|year| value|
+-------+-------+----+--------------------+
|school1|student|2018|[name_aaa, name_bbb]|
|school1|student|2019| [name_aaa]|
|school2|student|2019| [name_aaa]|
+-------+-------+----+--------------------+
scala> val myUdf = udf((year: String, values: Seq[String]) => Map(year -> values))
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,MapType(StringType,ArrayType(StringType,true),true),Some(List(StringType, ArrayType(StringType,true))))
scala> val df3 = df2.withColumn("values",myUdf($"year",$"value")).drop("year","value")
df3: org.apache.spark.sql.DataFrame = [group_1: string, group_2: string ... 1 more field]
scala> val df4 = df3.groupBy("group_1","group_2").agg(collect_list("values").as("value_map"))
df4: org.apache.spark.sql.DataFrame = [group_1: string, group_2: string ... 1 more field]
scala> df4.printSchema
root
|-- group_1: string (nullable = true)
|-- group_2: string (nullable = true)
|-- value_map: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: string (containsNull = true)
scala> df4.show(false)
+-------+-------+------------------------------------------------------+
|group_1|group_2|value_map |
+-------+-------+------------------------------------------------------+
|school1|student|[[2018 -> [name_aaa, name_bbb]], [2019 -> [name_aaa]]]|
|school2|student|[[2019 -> [name_aaa]]] |
+-------+-------+------------------------------------------------------+
讓我知道它是否有幫助!!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.