[英]Pyspark: match columns from two different dataframes and add value
I am trying to compare the values of two columns that exist in different dataframes to create a new dataframe based on the matching of the criteria:我正在尝试比较不同数据框中存在的两列的值,以根据条件匹配创建一个新的 dataframe:
df1= df1=
| id |
| -- |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
df2 = df2 =
| id |
| -- |
| 2 |
| 5 |
| 1 |
So, I want to add an 'x' in the is_used field when the field of df2 exists in the field of df1, else add 'NA', to generate a result dataframe like this:因此,当 df2 的字段存在于 df1 的字段中时,我想在 is_used 字段中添加一个“x”,否则添加“NA”,以生成如下结果 dataframe:
df3 = df3 =
| id | is_used |
| -- | ------- |
| 1 | X |
| 2 | X |
| 3 | NA |
| 4 | NA |
| 5 | X |
I have tried this way, but the selection criteria places an "X" in all columns:我已经尝试过这种方式,但是选择标准在所有列中都放置了一个“X”:
df3 = df3.withColumn('is_used', F.when(
condition = (F.arrays_overlap(F.array(df1.id), F.array(df2.id))) == False,
value = 'NA'
).otherwise('X'))
I would appreciate any help我将不胜感激任何帮助
Try the following code, it would give you a similar result and you can make the rest of the changes:试试下面的代码,它会给你类似的结果,你可以对 rest 进行修改:
df3 = df1.alias("df1").\
join(df2.alias("df2"), (df1.id==df2.id), how='left').\
withColumn('is_true', F.when(df1.id == df2.id,F.lit("X")).otherwise(F.lit("NA"))).\
select("df1.*","is_true")
df3.show()
Try with fullouter
join:尝试使用
fullouter
join:
df3 = (
df1.join(df2.alias("df2"), df1.id == df2.id, "fullouter")
.withColumn(
"is_used",
F.when(F.col("df2.id").isNotNull(), F.lit("X")).otherwise(F.lit("NA")),
)
.drop(F.col("df2.id"))
.orderBy(F.col("id"))
)
Result:结果:
+---+-------+
|id |is_used|
+---+-------+
|1 |X |
|2 |X |
|3 |NA |
|4 |NA |
|5 |X |
+---+-------+
First of all, I want to thank the people who contributed their code, it was very useful to understand what was happening.首先,我要感谢贡献代码的人,了解正在发生的事情非常有用。
The problem was that when trying to do df1.id == df2.id Spark was inferring both columns as one because they both had the same name, so the result of all iterations would always be True.问题是,当尝试执行 df1.id == df2.id Spark 时,因为它们都具有相同的名称,所以将两列推断为一列,因此所有迭代的结果将始终为 True。
Just rename the fields I wanted to compare and it totally worked for me.只需重命名我想比较的字段,它就完全适合我。
Here is the code:这是代码:
df2 = df2.withColumnRenamed("id", "id1")
df3 = df1.alias("df1").join(df2.alias("df2"),
(df1.id == df2.id1), "left")
df3 = df3.withColumn("is_used", F.when(df1.id == df2.id1),
"X").otherwise("NA")
df3 = df3.drop("id1")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.