Pyspark：匹配来自两个不同数据帧的列并添加值

Question

I am trying to compare the values of two columns that exist in different dataframes to create a new dataframe based on the matching of the criteria:我正在尝试比较不同数据框中存在的两列的值，以根据条件匹配创建一个新的 dataframe：

df1= df1=

| id |
| -- |
| 1  |
| 2  |
| 3  |
| 4  | 
| 5  |

df2 = df2 =

| id |
| -- |
| 2  |
| 5  |
| 1  |

So, I want to add an 'x' in the is_used field when the field of df2 exists in the field of df1, else add 'NA', to generate a result dataframe like this:因此，当 df2 的字段存在于 df1 的字段中时，我想在 is_used 字段中添加一个“x”，否则添加“NA”，以生成如下结果 dataframe：

df3 = df3 =

| id | is_used |
| -- | ------- |
| 1  |    X    |
| 2  |    X    |
| 3  |    NA   |
| 4  |    NA   |
| 5  |    X    |

I have tried this way, but the selection criteria places an "X" in all columns:我已经尝试过这种方式，但是选择标准在所有列中都放置了一个“X”：

df3 = df3.withColumn('is_used', F.when(
    condition = (F.arrays_overlap(F.array(df1.id), F.array(df2.id))) == False,
    value = 'NA'
).otherwise('X'))

I would appreciate any help我将不胜感激任何帮助

Answer 1

Try the following code, it would give you a similar result and you can make the rest of the changes:试试下面的代码，它会给你类似的结果，你可以对 rest 进行修改：


df3 = df1.alias("df1").\
    join(df2.alias("df2"), (df1.id==df2.id), how='left').\
    withColumn('is_true', F.when(df1.id == df2.id,F.lit("X")).otherwise(F.lit("NA"))).\ 
    select("df1.*","is_true")

df3.show()

Answer 2

Try with fullouter join:尝试使用fullouter join：

df3 = (
    df1.join(df2.alias("df2"), df1.id == df2.id, "fullouter")
    .withColumn(
        "is_used",
        F.when(F.col("df2.id").isNotNull(), F.lit("X")).otherwise(F.lit("NA")),
    )
    .drop(F.col("df2.id"))
    .orderBy(F.col("id"))
)

Result:结果：

+---+-------+                                                                   
|id |is_used|
+---+-------+
|1  |X      |
|2  |X      |
|3  |NA     |
|4  |NA     |
|5  |X      |
+---+-------+

Answer 3

First of all, I want to thank the people who contributed their code, it was very useful to understand what was happening.首先，我要感谢贡献代码的人，了解正在发生的事情非常有用。

The problem was that when trying to do df1.id == df2.id Spark was inferring both columns as one because they both had the same name, so the result of all iterations would always be True.问题是，当尝试执行 df1.id == df2.id Spark 时，因为它们都具有相同的名称，所以将两列推断为一列，因此所有迭代的结果将始终为 True。

Just rename the fields I wanted to compare and it totally worked for me.只需重命名我想比较的字段，它就完全适合我。

Here is the code:这是代码：

df2 = df2.withColumnRenamed("id", "id1")

df3 = df1.alias("df1").join(df2.alias("df2"),
                            (df1.id == df2.id1), "left")

df3 = df3.withColumn("is_used", F.when(df1.id == df2.id1), 
                     "X").otherwise("NA")

df3 = df3.drop("id1")

Pyspark：匹配来自两个不同数据帧的列并添加值

问题描述

3 个解决方案

解决方案1
0 2021-11-18 02:26:43

解决方案2
0 2021-11-18 08:33:19

解决方案3
0 已采纳 2021-11-24 21:10:59

Pyspark：匹配来自两个不同数据帧的列并添加值

问题描述

3 个解决方案

解决方案1 0 2021-11-18 02:26:43

解决方案2 0 2021-11-18 08:33:19

解决方案3 0 已采纳 2021-11-24 21:10:59

解决方案1
0 2021-11-18 02:26:43

解决方案2
0 2021-11-18 08:33:19

解决方案3
0 已采纳 2021-11-24 21:10:59