繁体   English   中英

如何连接两个 Pyspark 数据帧的不同元素

[英]How to join between different elements of two Pyspark dataframes

我有两个名为df1和df2的数据框,数据dataframe的内容如下。

df1:

line_item_usage_account_id  line_item_unblended_cost    name 
100000000001                12.05                       account1
200000000001                52                          account2
300000000003                12.03                       account3

df2:

accountname     accountproviderid   clustername     app_pmo     app_costcenter      line_item_unblended_cost
account1        100000000001        cluster1        111111      11111111            12.05
account2        200000000001        cluster2        222222      22222222            52

我需要将不在 df2.accountproviderid 中的 df1.line_item_usage_account_id 的 ID 添加到联接中,如下所示:

accountname     accountproviderid   clustername     app_pmo     app_costcenter      line_item_unblended_cost
account1        100000000001        cluster1        111111      11111111            12.05
account2        200000000001        cluster2        222222      22222222            52
account3        300000000003        NA              NA          NA                  12.03

df2.accountproviderid 中找不到来自 df1.line_item_usage_account_id 的 id "300000000003",因此它被添加到新的 dataframe 中。

知道如何实现这一目标吗? 我很感激任何帮助。

您可以在此处使用right join

df2.join(df1, (df2.accountproviderid == df1.line_item_usage_account_id), "right")\
    .drop("accountname", "accountproviderid")\
    .withColumnRenamed("line_item_usage_account_id", "accountproviderid")\
    .withColumnRenamed("name", "accountname")\
    .select("accountname", "accountproviderid", "clustername", "app_pmo",\
     "app_costcenter", "line_item_unblended_cost").show()

+-----------+-----------------+-----------+-------+--------------+------------------------+
|accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
+-----------+-----------------+-----------+-------+--------------+------------------------+
|   account1|     100000000001|   cluster1| 111111|      11111111|                   12.05|
|   account2|     200000000001|   cluster2| 222222|      22222222|                    52.0|
|   account3|     300000000003|       null|   null|          null|                   12.03|
+-----------+-----------------+-----------+-------+--------------+------------------------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM