PySpark：使用數組值作為列標題從嵌套 Arrays 動態創建 Dataframe

Question

我面臨一個問題希望有人能提供幫助。

假設我有傳入數據 stream，其記錄如下所示：

{ “標題”：[“col_a”，“col_b”，“col_c”，“col_d”]，“數據”：[[“0”，“1”，“2”，“3”]，[“0.2” "0.1","3","4"],["5","4","3","2"]]}

{ “標題”：[“col_a”，“col_b”，“col_c”，“col_d”]，“數據”：[[“0.1”，“1.2”，“2.5”，“3”]，[“0” "0","1","0"]]}

...

現在進一步假設數據被清理為：

“標題”字段始終包含相同的數組
“數據”數組中的 arrays 始終與標題數組的長度相同

有沒有把上面的記錄變成dataframe這樣的PySpark？

可樂	col_b	col_c	寒冷的
0	1個	2個	3個
0.2	0.1	3個	4個
5個	4個	3個	2個
0.1	1.2	2.5	3個
0	0	1個	0

非常感謝任何評論和/或工作代碼。

Answer 1

我找到了兩種方法

使用 map
使用 pivot 操作 - 為此，您需要能夠將所有數據收集到分組操作的單個分區中。 所以它會在大型數據集上失敗

Map 解決方案

  df = spark.read.json("map.json")
  
  # collect the headers as a list
  headers = df.select(F.explode("headers").alias("headers")).distinct().collect()
  headers = [r.headers for r in headers]
  
  # explode data arrays so that it has the same dimensions as the header array
  df = df.select(F.explode("data").alias("data"), "headers")
  # zip data and headers together to form a map
  df = df.select(F.arrays_zip("headers", "data").alias("map"))
  df = df.select(F.map_from_entries("map").alias("map"))
  
  #select out your headers from the map to form columns
  df = df.select(*[df.map.getItem(col).alias(col) for col in headers])
  df.show(truncate=False)
  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
  +-----+-----+-----+-----+
  |col_a|col_b|col_c|col_d|
  +-----+-----+-----+-----+
  |0    |1    |2    |3    |
  |0.2  |0.1  |3    |4    |
  |5    |4    |3    |2    |
  +-----+-----+-----+-----+

Pivot 解決方案

  df = spark.read.json("map.json")

  # explode data arrays so that it has the same dimensions as the header array
  df = df.select(F.explode("data").alias("data"), "headers")
  # zip data and headers together and explode it into rows
  df = df.select(F.arrays_zip("headers", "data").alias("zipped"))
  df = df.select(F.explode("zipped").alias("exploded_struct"))
  df = df.selectExpr("exploded_struct.*")
  
  # Add a single index so that we can group by it and then pivot our headers into columns. 
  # **NB This groups all of our data into a single partition
  df = df.withColumn("idx", F.lit(1))
  df = df.groupBy("idx").pivot("headers").agg(F.collect_list("data").alias("data")).drop("idx")
  
  # each column now contains it's array of data. in order to explode it we need to zip all of them 
  # together and explode in a single operation
  df = df.withColumn("zipped",F.arrays_zip(*df.columns))
  df = df.select(F.explode("zipped").alias("exploded_struct"))
  
  # Finally we select out our headers from the exploded_struct
  df = df.selectExpr("exploded_struct.*")
  df.show(truncate=False)

 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
   +-----+-----+-----+-----+
   |col_a|col_b|col_c|col_d|
   +-----+-----+-----+-----+
   |0    |1    |2    |3    |
   |0.2  |0.1  |3    |4    |
   |5    |4    |3    |2    |
   +-----+-----+-----+-----+

PySpark：使用數組值作為列標題從嵌套 Arrays 動態創建 Dataframe

問題描述

1 個解決方案

解決方案1
0 已采納 2021-09-26 11:43:43

我找到了兩種方法

Map 解決方案

Pivot 解決方案

PySpark：使用數組值作為列標題從嵌套 Arrays 動態創建 Dataframe

問題描述

1 個解決方案

解決方案1 0 已采納 2021-09-26 11:43:43

我找到了兩種方法

Map 解決方案

Pivot 解決方案

解決方案1
0 已采納 2021-09-26 11:43:43