[英]Check if PySaprk column values exists in another dataframe column values
I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again.我试图找出条件来检查一个 PySpark dataframe 的值是否存在于另一个 PySpark dataframe 中,如果存在,则提取该值并再次比较。 I was thinking of doing a multiple
withColumn()
with a when()
function.我正在考虑使用
when()
function 执行多个withColumn()
。
For example my two dataframes can be something like:例如我的两个数据框可以是这样的:
df1
| id | value |
| ----- | ---- |
| hello | 1111 |
| world | 2222 |
df2
| id | value |
| ------ | ---- |
| hello | 1111 |
| world | 3333 |
| people | 2222 |
And the result I wish to obtain is to check first if the value of df1.id
exists in df2.id
and if true return me the df2.value
, for example I was trying something like:我希望获得的结果是首先检查
df1.id
的值是否存在于df2.id
中,如果为真,则返回df2.value
,例如我正在尝试类似的方法:
df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))
So I get something like:所以我得到类似的东西:
df1
| id | value | df2_value |
| ----- | ---- | --------- |
| hello | 1111 | 1111 |
| world | 2222 | 3333 |
So that now I can do another check between these two value columns in the df1
dataframe, and return a boolean column ( 1
or 0
) in a new dataframe.所以现在我可以在
df1
dataframe 中的这两个值列之间进行另一次检查,并在新的 dataframe 中返回 boolean 列( 1
或0
)。
The result I wish to get would be something like:我希望得到的结果是这样的:
df3
| id | value | df2_value | match |
| ----- | ---- | --------- | ----- |
| hello | 1111 | 1111 | 1 |
| world | 2222 | 3333 | 0 |
Left join df1
with df2
on id
after prefixing all df2 columns except id
with df2_*
:在为除
id
之外的所有 df2 列加上df2_*
前缀后,在id
上左连接df1
和df2
:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])
df = df1.join(
df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
["id"],
"left"
)
Then using functools.reduce
you can construct a boolean expression to check if columns match in the 2 dataframes like this:然后使用
functools.reduce
你可以构建一个 boolean 表达式来检查列是否在 2 个数据框中匹配,如下所示:
from functools import reduce
check_expr = reduce(
lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
[c for c in df1.columns if c != 'id'],
F.lit(True)
)
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#| id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111| 1111| 1|
#|world| 2222| 3333| 0|
#+-----+-----+---------+-----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.