如何连接两个 Pyspark 数据帧的不同元素

Question

我有两个名为df1和df2的数据框，数据dataframe的内容如下。

df1：

line_item_usage_account_id  line_item_unblended_cost    name 
100000000001                12.05                       account1
200000000001                52                          account2
300000000003                12.03                       account3

df2:

accountname     accountproviderid   clustername     app_pmo     app_costcenter      line_item_unblended_cost
account1        100000000001        cluster1        111111      11111111            12.05
account2        200000000001        cluster2        222222      22222222            52

我需要将不在 df2.accountproviderid 中的 df1.line_item_usage_account_id 的 ID 添加到联接中，如下所示：

accountname     accountproviderid   clustername     app_pmo     app_costcenter      line_item_unblended_cost
account1        100000000001        cluster1        111111      11111111            12.05
account2        200000000001        cluster2        222222      22222222            52
account3        300000000003        NA              NA          NA                  12.03

df2.accountproviderid 中找不到来自 df1.line_item_usage_account_id 的 id "300000000003"，因此它被添加到新的 dataframe 中。

知道如何实现这一目标吗？ 我很感激任何帮助。

Answer 1

您可以在此处使用right join ：

df2.join(df1, (df2.accountproviderid == df1.line_item_usage_account_id), "right")\
    .drop("accountname", "accountproviderid")\
    .withColumnRenamed("line_item_usage_account_id", "accountproviderid")\
    .withColumnRenamed("name", "accountname")\
    .select("accountname", "accountproviderid", "clustername", "app_pmo",\
     "app_costcenter", "line_item_unblended_cost").show()

+-----------+-----------------+-----------+-------+--------------+------------------------+
|accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
+-----------+-----------------+-----------+-------+--------------+------------------------+
|   account1|     100000000001|   cluster1| 111111|      11111111|                   12.05|
|   account2|     200000000001|   cluster2| 222222|      22222222|                    52.0|
|   account3|     300000000003|       null|   null|          null|                   12.03|
+-----------+-----------------+-----------+-------+--------------+------------------------+

如何连接两个 Pyspark 数据帧的不同元素

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-03-09 00:29:46

如何连接两个 Pyspark 数据帧的不同元素

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-03-09 00:29:46

解决方案1
1 已采纳 2021-03-09 00:29:46