簡體   English   中英

PySpark:使用數組值作為列標題從嵌套 Arrays 動態創建 Dataframe

[英]PySpark: Create Dataframe Dynamically from Nested Arrays using Array Values as Column Headers

我面臨一個問題希望有人能提供幫助。

假設我有傳入數據 stream,其記錄如下所示:

{ “標題”:[“col_a”,“col_b”,“col_c”,“col_d”],“數據”:[[“0”,“1”,“2”,“3”],[“0.2” "0.1","3","4"],["5","4","3","2"]]}

{ “標題”:[“col_a”,“col_b”,“col_c”,“col_d”],“數據”:[[“0.1”,“1.2”,“2.5”,“3”],[“0” "0","1","0"]]}

...

現在進一步假設數據被清理為:

  1. “標題”字段始終包含相同的數組
  2. “數據”數組中的 arrays 始終與標題數組的長度相同

有沒有把上面的記錄變成dataframe這樣的PySpark?

可樂 col_b col_c 寒冷的
0 1個 2個 3個
0.2 0.1 3個 4個
5個 4個 3個 2個
0.1 1.2 2.5 3個
0 0 1個 0

非常感謝任何評論和/或工作代碼。

我找到了兩種方法

  1. 使用 map
  2. 使用 pivot 操作 - 為此,您需要能夠將所有數據收集到分組操作的單個分區中。 所以它會在大型數據集上失敗

Map 解決方案

  df = spark.read.json("map.json")
  
  # collect the headers as a list
  headers = df.select(F.explode("headers").alias("headers")).distinct().collect()
  headers = [r.headers for r in headers]
  
  # explode data arrays so that it has the same dimensions as the header array
  df = df.select(F.explode("data").alias("data"), "headers")
  # zip data and headers together to form a map
  df = df.select(F.arrays_zip("headers", "data").alias("map"))
  df = df.select(F.map_from_entries("map").alias("map"))
  
  #select out your headers from the map to form columns
  df = df.select(*[df.map.getItem(col).alias(col) for col in headers])
  df.show(truncate=False)
  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
  +-----+-----+-----+-----+
  |col_a|col_b|col_c|col_d|
  +-----+-----+-----+-----+
  |0    |1    |2    |3    |
  |0.2  |0.1  |3    |4    |
  |5    |4    |3    |2    |
  +-----+-----+-----+-----+

Pivot 解決方案

  df = spark.read.json("map.json")

  # explode data arrays so that it has the same dimensions as the header array
  df = df.select(F.explode("data").alias("data"), "headers")
  # zip data and headers together and explode it into rows
  df = df.select(F.arrays_zip("headers", "data").alias("zipped"))
  df = df.select(F.explode("zipped").alias("exploded_struct"))
  df = df.selectExpr("exploded_struct.*")
  
  # Add a single index so that we can group by it and then pivot our headers into columns. 
  # **NB This groups all of our data into a single partition
  df = df.withColumn("idx", F.lit(1))
  df = df.groupBy("idx").pivot("headers").agg(F.collect_list("data").alias("data")).drop("idx")
  
  # each column now contains it's array of data. in order to explode it we need to zip all of them 
  # together and explode in a single operation
  df = df.withColumn("zipped",F.arrays_zip(*df.columns))
  df = df.select(F.explode("zipped").alias("exploded_struct"))
  
  # Finally we select out our headers from the exploded_struct
  df = df.selectExpr("exploded_struct.*")
  df.show(truncate=False)

 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
   +-----+-----+-----+-----+
   |col_a|col_b|col_c|col_d|
   +-----+-----+-----+-----+
   |0    |1    |2    |3    |
   |0.2  |0.1  |3    |4    |
   |5    |4    |3    |2    |
   +-----+-----+-----+-----+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM