![](/img/trans.png)
[英]Pandas: create dataframe with column headers and cell values from tuples in a dictionary
[英]PySpark: Create Dataframe Dynamically from Nested Arrays using Array Values as Column Headers
我面臨一個問題希望有人能提供幫助。
假設我有傳入數據 stream,其記錄如下所示:
{ “標題”:[“col_a”,“col_b”,“col_c”,“col_d”],“數據”:[[“0”,“1”,“2”,“3”],[“0.2” "0.1","3","4"],["5","4","3","2"]]}
{ “標題”:[“col_a”,“col_b”,“col_c”,“col_d”],“數據”:[[“0.1”,“1.2”,“2.5”,“3”],[“0” "0","1","0"]]}
...
現在進一步假設數據被清理為:
有沒有把上面的記錄變成dataframe這樣的PySpark?
可樂 | col_b | col_c | 寒冷的 |
---|---|---|---|
0 | 1個 | 2個 | 3個 |
0.2 | 0.1 | 3個 | 4個 |
5個 | 4個 | 3個 | 2個 |
0.1 | 1.2 | 2.5 | 3個 |
0 | 0 | 1個 | 0 |
非常感謝任何評論和/或工作代碼。
df = spark.read.json("map.json")
# collect the headers as a list
headers = df.select(F.explode("headers").alias("headers")).distinct().collect()
headers = [r.headers for r in headers]
# explode data arrays so that it has the same dimensions as the header array
df = df.select(F.explode("data").alias("data"), "headers")
# zip data and headers together to form a map
df = df.select(F.arrays_zip("headers", "data").alias("map"))
df = df.select(F.map_from_entries("map").alias("map"))
#select out your headers from the map to form columns
df = df.select(*[df.map.getItem(col).alias(col) for col in headers])
df.show(truncate=False)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
+-----+-----+-----+-----+
|col_a|col_b|col_c|col_d|
+-----+-----+-----+-----+
|0 |1 |2 |3 |
|0.2 |0.1 |3 |4 |
|5 |4 |3 |2 |
+-----+-----+-----+-----+
df = spark.read.json("map.json")
# explode data arrays so that it has the same dimensions as the header array
df = df.select(F.explode("data").alias("data"), "headers")
# zip data and headers together and explode it into rows
df = df.select(F.arrays_zip("headers", "data").alias("zipped"))
df = df.select(F.explode("zipped").alias("exploded_struct"))
df = df.selectExpr("exploded_struct.*")
# Add a single index so that we can group by it and then pivot our headers into columns.
# **NB This groups all of our data into a single partition
df = df.withColumn("idx", F.lit(1))
df = df.groupBy("idx").pivot("headers").agg(F.collect_list("data").alias("data")).drop("idx")
# each column now contains it's array of data. in order to explode it we need to zip all of them
# together and explode in a single operation
df = df.withColumn("zipped",F.arrays_zip(*df.columns))
df = df.select(F.explode("zipped").alias("exploded_struct"))
# Finally we select out our headers from the exploded_struct
df = df.selectExpr("exploded_struct.*")
df.show(truncate=False)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
+-----+-----+-----+-----+
|col_a|col_b|col_c|col_d|
+-----+-----+-----+-----+
|0 |1 |2 |3 |
|0.2 |0.1 |3 |4 |
|5 |4 |3 |2 |
+-----+-----+-----+-----+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.