繁体   English   中英

Pyspark 加入数据框

[英]Pyspark Join data frame

我有两个火花数据框。

df1

id    product  price
0     x        100
1     y        120
2     z        110
3     x        150
4     x        100

和 df2

id    unique_products 
0     x        
1     y        
2     z         

我怎样才能得到这个结果:

id    unique_products  prices
0     x                [100, 150, 100]                      
1     y                [120]
2     z                [110]

您可以按product分组并在price上应用collect_list 最后加入df2以获得id

from pyspark.sql import functions as F

data1 = [(0, "x", 100,),
        (1, "y", 120,),
        (2, "z", 110,),
        (3, "x", 150,),
        (4, "x", 100,), ]

data2 = [(0, "x", ), (1, "y", ), (2, "z", ), ]

df1 = spark.createDataFrame(data1,("id", "product", "price",)) 
df2 = spark.createDataFrame(data2,("id", "unique_products", ))

df_prices = df1.groupBy("product").agg(F.collect_list("price").alias("prices")).selectExpr("product as unique_products", "prices")

df2.join(df_prices, ["unique_products"]).select("id", "unique_products", "prices").show()

Output

+---+---------------+---------------+
| id|unique_products|         prices|
+---+---------------+---------------+
|  0|              x|[100, 150, 100]|
|  1|              y|          [120]|
|  2|              z|          [110]|
+---+---------------+---------------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM