[英]How to convert a spark dataframe to a list of structs in scala
我有一個由 12 行和不同列組成的 spark 數據框,在這種情況下為 22。
我想將其轉換為以下格式的數據幀:
root
|-- data: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- ast: double (nullable = true)
| | |-- blk: double (nullable = true)
| | |-- dreb: double (nullable = true)
| | |-- fg3_pct: double (nullable = true)
| | |-- fg3a: double (nullable = true)
| | |-- fg3m: double (nullable = true)
| | |-- fg_pct: double (nullable = true)
| | |-- fga: double (nullable = true)
| | |-- fgm: double (nullable = true)
| | |-- ft_pct: double (nullable = true)
| | |-- fta: double (nullable = true)
| | |-- ftm: double (nullable = true)
| | |-- games_played: long (nullable = true)
| | |-- seconds: double (nullable = true)
| | |-- oreb: double (nullable = true)
| | |-- pf: double (nullable = true)
| | |-- player_id: long (nullable = true)
| | |-- pts: double (nullable = true)
| | |-- reb: double (nullable = true)
| | |-- season: long (nullable = true)
| | |-- stl: double (nullable = true)
| | |-- turnover: double (nullable = true)
其中數據幀data
字段的每個元素對應於原始數據幀的不同行。
最終目標是將其導出為.json文件,其格式如下:
{"data": [{row1}, {row2}, ..., {row12}]}
我目前使用的代碼如下:
val best_12_struct = best_12.withColumn("data", array((0 to 11).map(i => struct(col("ast"), col("blk"), col("dreb"), col("fg3_pct"), col("fg3a"),
col("fg3m"), col("fg_pct"), col("fga"), col("fgm"),
col("ft_pct"), col("fta"), col("ftm"), col("games_played"),
col("seconds"), col("oreb"), col("pf"), col("player_id"),
col("pts"), col("reb"), col("season"), col("stl"), col("turnover"))) : _*))
val best_12_data = best_12_struct.select("data")
但是array(0 to 11)
將相同元素復制了 12 次到data
。 因此,我最終獲得的.json
有 12 {"data": ...}
,在每一行中復制了 12 次,而不是只有一個{"data": ...}
有 12 個元素,每個對應於原始數據框的一行。
您有 12 次相同的行,因為withColumn
方法只會從當前處理的行中選擇信息。
您需要使用collect_list
在數據幀級別聚合行, collect_list
是一個聚合函數,如下所示:
import org.apache.spark.sql.functions._
val best_12_data = best_12
.withColumn("row", struct(col("ast"), col("blk"), col("dreb"), col("fg3_pct"), col("fg3a"), col("fg3m"), col("fg_pct"), col("fga"), col("fgm"), col("ft_pct"), col("fta"), col("ftm"), col("games_played"), col("seconds"), col("oreb"), col("pf"), col("player_id"), col("pts"), col("reb"), col("season"), col("stl"), col("turnover")))
.agg(collect_list(col("row")).as("data"))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.