简体   繁体   English

检查 PySaprk 列值是否存在于另一个 dataframe 列值中

[英]Check if PySaprk column values exists in another dataframe column values

I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again.我试图找出条件来检查一个 PySpark dataframe 的值是否存在于另一个 PySpark dataframe 中,如果存在,则提取该值并再次比较。 I was thinking of doing a multiple withColumn() with a when() function.我正在考虑使用when() function 执行多个withColumn()

For example my two dataframes can be something like:例如我的两个数据框可以是这样的:

df1
| id    | value |
| ----- | ----  |
| hello | 1111  |
| world | 2222  |

df2
| id     | value |
| ------ | ----  |
| hello  | 1111  |
| world  | 3333  |
| people | 2222  |

And the result I wish to obtain is to check first if the value of df1.id exists in df2.id and if true return me the df2.value , for example I was trying something like:我希望获得的结果是首先检查df1.id的值是否存在于df2.id中,如果为真,则返回df2.value ,例如我正在尝试类似的方法:

df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))

So I get something like:所以我得到类似的东西:

df1
| id    | value | df2_value |
| ----- | ----  | --------- |
| hello | 1111  | 1111      |
| world | 2222  | 3333      |

So that now I can do another check between these two value columns in the df1 dataframe, and return a boolean column ( 1 or 0 ) in a new dataframe.所以现在我可以在df1 dataframe 中的这两个值列之间进行另一次检查,并在新的 dataframe 中返回 boolean 列( 10 )。

The result I wish to get would be something like:我希望得到的结果是这样的:

df3
| id    | value | df2_value | match |
| ----- | ----  | --------- | ----- |
| hello | 1111  | 1111      | 1     |
| world | 2222  | 3333      | 0     |

Left join df1 with df2 on id after prefixing all df2 columns except id with df2_* :在为除id之外的所有 df2 列加上df2_*前缀后,在id上左连接df1df2

from pyspark.sql import functions as F

df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])

df = df1.join(
    df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
    ["id"],
    "left"
)

Then using functools.reduce you can construct a boolean expression to check if columns match in the 2 dataframes like this:然后使用functools.reduce你可以构建一个 boolean 表达式来检查列是否在 2 个数据框中匹配,如下所示:

from functools import reduce

check_expr = reduce(
    lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
    [c for c in df1.columns if c != 'id'],
    F.lit(True)
)
    
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#|   id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111|     1111|    1|
#|world| 2222|     3333|    0|
#+-----+-----+---------+-----+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 检查另一个数据框列中是否存在数据框列中的少数值 - To check if few values in dataframe column exists in another dataframe column 检查一列中的值是否存在于另一数据框中的多列中 - Check if values from one column, exists in multiple columns in another dataframe 使用 monotonically_increasing_id() 创建增量,但由 pysaprk dataframe 中另一列的一组值 - Use monotonically_increasing_id() to create an incremental, but by a group of values from another column in pysaprk dataframe 检查列值是否存在于不同的 dataframe - Check if column values exists in different dataframe 如何检查数据框中的另一列中是否存在列的唯一值? - How do i check that the unique values of a column exists in another column in dataframe? 如何检查一个数据帧中的列值是否可用或不检查另一数据帧的列中的值? - How to check values of column in one dataframe available or not in column of another dataframe? 检查列的值是否在熊猫数据框中的另一个列数组中 - Check if values of a column is in another column array in a pandas dataframe 如果另一列中存在任何值> 0,则需要为数据框分配值 - Need to assign values to a dataframe if any value >0 exists in another column 检查每个列值是否存在于另一个 dataframe 列中,其中另一个列值是列 header - Check if each column values exist in another dataframe column where another column value is the column header 检查一个 dataframe 中的列对是否存在于另一个中? - Check if column pair in one dataframe exists in another?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM