[英]Pyspark Join data frame
我有两个火花数据框。
df1
id product price
0 x 100
1 y 120
2 z 110
3 x 150
4 x 100
和 df2
id unique_products
0 x
1 y
2 z
我怎样才能得到这个结果:
id unique_products prices
0 x [100, 150, 100]
1 y [120]
2 z [110]
您可以按product
分组并在price
上应用collect_list
。 最后加入df2
以获得id
。
from pyspark.sql import functions as F
data1 = [(0, "x", 100,),
(1, "y", 120,),
(2, "z", 110,),
(3, "x", 150,),
(4, "x", 100,), ]
data2 = [(0, "x", ), (1, "y", ), (2, "z", ), ]
df1 = spark.createDataFrame(data1,("id", "product", "price",))
df2 = spark.createDataFrame(data2,("id", "unique_products", ))
df_prices = df1.groupBy("product").agg(F.collect_list("price").alias("prices")).selectExpr("product as unique_products", "prices")
df2.join(df_prices, ["unique_products"]).select("id", "unique_products", "prices").show()
+---+---------------+---------------+
| id|unique_products| prices|
+---+---------------+---------------+
| 0| x|[100, 150, 100]|
| 1| y| [120]|
| 2| z| [110]|
+---+---------------+---------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.