[英]How to join between different elements of two Pyspark dataframes
我有两个名为df1和df2的数据框,数据dataframe的内容如下。
df1:
line_item_usage_account_id line_item_unblended_cost name
100000000001 12.05 account1
200000000001 52 account2
300000000003 12.03 account3
df2:
accountname accountproviderid clustername app_pmo app_costcenter line_item_unblended_cost
account1 100000000001 cluster1 111111 11111111 12.05
account2 200000000001 cluster2 222222 22222222 52
我需要将不在 df2.accountproviderid 中的 df1.line_item_usage_account_id 的 ID 添加到联接中,如下所示:
accountname accountproviderid clustername app_pmo app_costcenter line_item_unblended_cost
account1 100000000001 cluster1 111111 11111111 12.05
account2 200000000001 cluster2 222222 22222222 52
account3 300000000003 NA NA NA 12.03
df2.accountproviderid 中找不到来自 df1.line_item_usage_account_id 的 id "300000000003",因此它被添加到新的 dataframe 中。
知道如何实现这一目标吗? 我很感激任何帮助。
您可以在此处使用right join
:
df2.join(df1, (df2.accountproviderid == df1.line_item_usage_account_id), "right")\
.drop("accountname", "accountproviderid")\
.withColumnRenamed("line_item_usage_account_id", "accountproviderid")\
.withColumnRenamed("name", "accountname")\
.select("accountname", "accountproviderid", "clustername", "app_pmo",\
"app_costcenter", "line_item_unblended_cost").show()
+-----------+-----------------+-----------+-------+--------------+------------------------+
|accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
+-----------+-----------------+-----------+-------+--------------+------------------------+
| account1| 100000000001| cluster1| 111111| 11111111| 12.05|
| account2| 200000000001| cluster2| 222222| 22222222| 52.0|
| account3| 300000000003| null| null| null| 12.03|
+-----------+-----------------+-----------+-------+--------------+------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.