简体   繁体   English

PySpark 过滤器 DataFrame 其中一个列中的值在另一个 DataFrame 列中不存在

[英]PySpark filter DataFrame where values in a column do not exist in another DataFrame column

I don't understand why this isn't working in PySpark...我不明白为什么这在 PySpark 中不起作用......

I'm trying to split the data into an approved DataFrame and a rejected DataFrame based on column values.我正在尝试根据列值将数据拆分为已approved的 DataFrame 和rejected的 DataFrame。 So rejected looks at the language column values in approved and only returns rows where the language does not exist in the approved DataFrame's language column:因此, rejected查看approvedlanguage列值,仅返回已approved的 DataFrame 的language列中不存在该language的行:

# Data
columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]

df = spark.createDataFrame(data, columns)
df.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# |    Java|      20000|
# |  Python|     100000|
# |   Scala|       3000|
# |     C++|      10000|
# |      C#|   32195432|
# |       C|     238135|
# |       R|     134315|
# |    Ruby|        235|
# |       C|       1000|
# |       R|       2000|
# |    Ruby|       4000|
# +--------+-----------+

# Approved
is_approved = df.users_count > 10000
df_approved = df.filter(is_approved)
df_approved.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# |    Java|      20000|
# |  Python|     100000|
# |      C#|   32195432|
# |       C|     238135|
# |       R|     134315|
# +--------+-----------+

# Rejected
is_not_approved = ~df.language.isin(df_approved.language)
df_rejected = df.filter(is_not_approved)
df_rejected.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# +--------+-----------+

# Also tried
df.filter( ~df.language.contains(df_approved.language) ).show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# +--------+-----------+

So that doesn't make any sense - why is df_rejected empty?所以这没有任何意义 - 为什么df_rejected是空的?

Expected outcomes using other approaches:使用其他方法的预期结果:

SQL: SQL:

SELECT * FROM df
WHERE language NOT IN ( SELECT language FROM df_approved )

Python: Python:

data_approved = []
for language, users_count in data:
    if users_count > 10000:
        data_approved.append((language, users_count))

data_rejected = []
for language, users_count in data:
    if language not in [row[0] for row in data_approved]:
        data_rejected.append((language, users_count))

print(data_approved)
print(data_rejected)
# [('Java', 20000), ('Python', 100000), ('C#', 32195432), ('C', 238135), ('R', 134315)]
# [('Scala', 3000), ('C++', 10000), ('Ruby', 235), ('Ruby', 4000)]

Why is PySpark not filtering as expected?为什么 PySpark 没有按预期过滤?

First of all you will want to use a window to select the maximum user_count of rows by language .首先,您需要使用window到 select 的最大user_count of rows by language

from pyspark.sql import Window

columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]
df = spark.createDataFrame(data, columns)

df = (df.withColumn('max_users_count', 
                     functions.max('users_count')
                      .over(w))
                      .where(functions.col('users_count') 
                        == functions.col('max_users_count'))
                         .drop('max_users_count'))
df.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
|      C#|   32195432|
|     C++|      10000|
|       C|     238135|
|       R|     134315|
|   Scala|       3000|
|    Ruby|       4000|
|  Python|     100000|
|    Java|      20000|
+--------+-----------+

Then you can filter based on the specified condition.然后您可以根据指定的条件进行过滤。

is_approved = df.users_count > 10000
df_approved = df.filter(is_approved)
df_approved.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|      C#|   32195432|
|       C|     238135|
|       R|     134315|
+--------+-----------+

And then for the reverse of the condition, add the ~ symbol in the filter statement然后对于条件的反转,在过滤器语句中添加~符号

is_not_approved = df.filter(~is_approved)
is_not_approved.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
|   Scala|       3000|
|     C++|      10000|
|    Ruby|        235|
|       C|       1000|
|       R|       2000|
|    Ruby|       4000|
+--------+-----------+

Try to:尝试:

df.subtract(df_approved).show()
                                                                                    
+--------+-----------+
|language|users_count|
+--------+-----------+
|       R|       2000|
|    Ruby|       4000|
|   Scala|       3000|
|       C|       1000|
|     C++|      10000|
|    Ruby|        235|
+--------+-----------+

Went the SQL route:走 SQL 路线:

columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]

df = spark.createDataFrame(data, columns)
df_approved = df.filter(df.users_count > 10000)

df.createOrReplaceTempView("df")
df_approved.createOrReplaceTempView("df_approved")

df_not_approved = spark.sql("""
    SELECT * FROM df WHERE NOT EXISTS (
        SELECT 1 FROM df_approved
        WHERE df.language = df_approved.language
        )
""")

df_not_approved.show()

# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# |     C++|      10000|
# |    Ruby|        235|
# |    Ruby|       4000|
# |   Scala|       3000|
# +--------+-----------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark DataFrame 根据另一列过滤列 DataFrame 无连接 - Pyspark DataFrame Filter column based on a column in another DataFrame without join PySpark - 按列的值拆分/过滤DataFrame - PySpark - Split/Filter DataFrame by column's values 偏移现有日期值,其中值存在于 dataframe 的另一列中 - Offsetting an existing date value, where values exist in another column in dataframe 检查每个列值是否存在于另一个 dataframe 列中,其中另一个列值是列 header - Check if each column values exist in another dataframe column where another column value is the column header 如何将pyspark数据帧列中的值与pyspark中的另一个数据帧进行比较 - How to compare values in a pyspark dataframe column with another dataframe in pyspark PySpark:将DataFrame列的值与另一个DataFrame列匹配 - PySpark: match the values of a DataFrame column against another DataFrame column 如何在 PySpark DataFrame 的列中查找另一个 DataFrame 中不存在的值? - How to find values in a column of a PySpark DataFrame that don't exist in another DataFrame? 将 dataframe 中的值添加到另一个 dataframe pyspark 中的列 - Adding values from a dataframe to a column in another dataframe pyspark PySpark DataFrame - 过滤器嵌套列 - PySpark DataFrame - Filter nested column 熊猫从第二个数据框中选择的列,其中另一个列的值存在于主数据框中 - pandas selected columns from second dataframe where another column's values exist in a primary dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM