[英]PySpark - Convert column of Lists to Rows
我有一個pyspark數據幀。 我必須進行分組,然后將某些列聚合到一個列表中,以便我可以在數據框上應用UDF。
例如,我創建了一個數據框,然后按人分組。
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
df = df.groupby("Person").agg(F.collect_list(F.struct("Amount", "Budget", "Date")).alias("data"))
df.show(truncate=False)
+------+----------------------------------------------------------------------------+
|Person|data |
+------+----------------------------------------------------------------------------+
|Bob |[[85.8,Food,2017-09-13], [7.8,Household,2017-09-13], [6.52,Food,2017-06-13]]|
+------+----------------------------------------------------------------------------+
我遺漏了UDF,但UDF的結果數據框如下。
+------+--------------------------------------------------------------+
|Person|res |
+------+--------------------------------------------------------------+
|Bob |[[562,Food,June,1], [380,Household,Sept,4], [880,Food,Sept,2]]|
+------+--------------------------------------------------------------+
我需要將結果數據幀轉換為行,其中列表中的每個元素都是具有新列的新行。 這可以在下面看到。
+------+------------------------------+
|Person|Amount|Budget |Month|Cluster|
+------+------------------------------+
|Bob |562 |Food |June |1 |
|Bob |380 |Household|Sept |4 |
|Bob |880 |Food |Sept |2 |
+------+------------------------------+
您可以使用explode
和getItem
,如下所示:
# starting from this form:
+------+--------------------------------------------------------------
|Person|res |
+------+--------------------------------------------------------------+
|Bob |[[562,Food,June,1], [380,Household,Sept,4], [880,Food,Sept,2]]|
+------+--------------------------------------------------------------+
import pyspark.sql.functions as F
# explode res to have one row for each item in res
exploded_df = df.select("*", F.explode("res").alias("exploded_data"))
exploded_df.show(truncate=False)
# then use getItem to create separate columns
exploded_df = exploded_df.withColumn(
"Amount",
F.col("exploded_data").getItem("Amount") # either get by name or by index e.g. getItem(0) etc
)
exploded_df = exploded_df.withColumn(
"Budget",
F.col("exploded_data").getItem("Budget")
)
exploded_df = exploded_df.withColumn(
"Month",
F.col("exploded_data").getItem("Month")
)
exploded_df = exploded_df.withColumn(
"Cluster",
F.col("exploded_data").getItem("Cluster")
)
exploded_df.select("Person", "Amount", "Budget", "Month", "Cluster").show(10, False)
+------+------------------------------+
|Person|Amount|Budget |Month|Cluster|
+------+------------------------------+
|Bob |562 |Food |June |1 |
|Bob |380 |Household|Sept |4 |
|Bob |880 |Food |Sept |2 |
+------+------------------------------+
然后,您可以刪除不必要的列。 希望這有幫助,祝你好運!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.