![](/img/trans.png)
[英]Pyspark DataFrame Filter column based on a column in another DataFrame without join
[英]Pyspark: join dataframe as an array type column to another dataframe
我試圖在 pyspark 中加入兩個數據幀,但將一個表作為數組列加入另一個表。
例如,對於這些表:
from pyspark.sql import Row
df1 = spark.createDataFrame([
Row(a = 1, b = 'C', c = 26, d = 'abc'),
Row(a = 1, b = 'C', c = 27, d = 'def'),
Row(a = 1, b = 'D', c = 51, d = 'ghi'),
Row(a = 2, b = 'C', c = 40, d = 'abc'),
Row(a = 2, b = 'D', c = 45, d = 'abc'),
Row(a = 2, b = 'D', c = 38, d = 'def')
])
df2 = spark.createDataFrame([
Row(a = 1, b = 'C', e = 2, f = 'cba'),
Row(a = 1, b = 'D', e = 3, f = 'ihg'),
Row(a = 2, b = 'C', e = 7, f = 'cba'),
Row(a = 2, b = 'D', e = 9, f = 'cba')
])
我想在a
和b
列df1.c
df1 加入 df2 但df1.c
和df1.d
應該是單個數組類型列。 此外,應保留所有名稱。 新數據幀的輸出應該能夠轉換為這個 json 結構(前兩行的示例):
{
"a": 1,
"b": "C",
"e": 2,
"f": "cba",
"df1": [
{
"c": 26,
"d": "abc"
},
{
"c": 27,
"d": "def"
}
]
}
任何關於如何實現這一點的想法將不勝感激!
謝謝,
卡羅萊納州
根據您輸入的樣本數據:
from pyspark.sql import functions as F
df1 = df1.groupBy("a", "b").agg(
F.collect_list(F.struct(F.col("c"), F.col("d"))).alias("df1")
)
df1.show()
+---+---+--------------------+
| a| b| df1|
+---+---+--------------------+
| 1| C|[[26, abc], [27, ...|
| 1| D| [[51, ghi]]|
| 2| D|[[45, abc], [38, ...|
| 2| C| [[40, abc]]|
+---+---+--------------------+
df3 = df1.join(df2, on=["a", "b"])
df3.show()
+---+---+--------------------+---+---+
| a| b| df1| e| f|
+---+---+--------------------+---+---+
| 1| C|[[26, abc], [27, ...| 2|cba|
| 1| D| [[51, ghi]]| 3|ihg|
| 2| D|[[45, abc], [38, ...| 9|cba|
| 2| C| [[40, abc]]| 7|cba|
+---+---+--------------------+---+---+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.