PySpark 过滤器 DataFrame 其中一个列中的值在另一个 DataFrame 列中不存在

Question

I don't understand why this isn't working in PySpark...我不明白为什么这在 PySpark 中不起作用......

I'm trying to split the data into an approved DataFrame and a rejected DataFrame based on column values.我正在尝试根据列值将数据拆分为已approved的 DataFrame 和rejected的 DataFrame。 So rejected looks at the language column values in approved and only returns rows where the language does not exist in the approved DataFrame's language column:因此， rejected查看approved的language列值，仅返回已approved的 DataFrame 的language列中不存在该language的行：

# Data
columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]

df = spark.createDataFrame(data, columns)
df.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# |    Java|      20000|
# |  Python|     100000|
# |   Scala|       3000|
# |     C++|      10000|
# |      C#|   32195432|
# |       C|     238135|
# |       R|     134315|
# |    Ruby|        235|
# |       C|       1000|
# |       R|       2000|
# |    Ruby|       4000|
# +--------+-----------+

# Approved
is_approved = df.users_count > 10000
df_approved = df.filter(is_approved)
df_approved.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# |    Java|      20000|
# |  Python|     100000|
# |      C#|   32195432|
# |       C|     238135|
# |       R|     134315|
# +--------+-----------+

# Rejected
is_not_approved = ~df.language.isin(df_approved.language)
df_rejected = df.filter(is_not_approved)
df_rejected.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# +--------+-----------+

# Also tried
df.filter( ~df.language.contains(df_approved.language) ).show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# +--------+-----------+

So that doesn't make any sense - why is df_rejected empty?所以这没有任何意义 - 为什么df_rejected是空的？

Expected outcomes using other approaches:使用其他方法的预期结果：

SQL: SQL：

SELECT * FROM df
WHERE language NOT IN ( SELECT language FROM df_approved )

Python: Python：

data_approved = []
for language, users_count in data:
    if users_count > 10000:
        data_approved.append((language, users_count))

data_rejected = []
for language, users_count in data:
    if language not in [row[0] for row in data_approved]:
        data_rejected.append((language, users_count))

print(data_approved)
print(data_rejected)
# [('Java', 20000), ('Python', 100000), ('C#', 32195432), ('C', 238135), ('R', 134315)]
# [('Scala', 3000), ('C++', 10000), ('Ruby', 235), ('Ruby', 4000)]

Why is PySpark not filtering as expected?为什么 PySpark 没有按预期过滤？

Answer 1

First of all you will want to use a window to select the maximum user_count of rows by language .首先，您需要使用window到 select 的最大user_count of rows by language 。

from pyspark.sql import Window

columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]
df = spark.createDataFrame(data, columns)

df = (df.withColumn('max_users_count', 
                     functions.max('users_count')
                      .over(w))
                      .where(functions.col('users_count') 
                        == functions.col('max_users_count'))
                         .drop('max_users_count'))
df.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
|      C#|   32195432|
|     C++|      10000|
|       C|     238135|
|       R|     134315|
|   Scala|       3000|
|    Ruby|       4000|
|  Python|     100000|
|    Java|      20000|
+--------+-----------+

Then you can filter based on the specified condition.然后您可以根据指定的条件进行过滤。

is_approved = df.users_count > 10000
df_approved = df.filter(is_approved)
df_approved.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|      C#|   32195432|
|       C|     238135|
|       R|     134315|
+--------+-----------+

And then for the reverse of the condition, add the ~ symbol in the filter statement然后对于条件的反转，在过滤器语句中添加~符号

is_not_approved = df.filter(~is_approved)
is_not_approved.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
|   Scala|       3000|
|     C++|      10000|
|    Ruby|        235|
|       C|       1000|
|       R|       2000|
|    Ruby|       4000|
+--------+-----------+

Answer 2

Try to:尝试：

df.subtract(df_approved).show()
                                                                                    
+--------+-----------+
|language|users_count|
+--------+-----------+
|       R|       2000|
|    Ruby|       4000|
|   Scala|       3000|
|       C|       1000|
|     C++|      10000|
|    Ruby|        235|
+--------+-----------+

Answer 3

Went the SQL route:走 SQL 路线：

columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]

df = spark.createDataFrame(data, columns)
df_approved = df.filter(df.users_count > 10000)

df.createOrReplaceTempView("df")
df_approved.createOrReplaceTempView("df_approved")

df_not_approved = spark.sql("""
    SELECT * FROM df WHERE NOT EXISTS (
        SELECT 1 FROM df_approved
        WHERE df.language = df_approved.language
        )
""")

df_not_approved.show()

# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# |     C++|      10000|
# |    Ruby|        235|
# |    Ruby|       4000|
# |   Scala|       3000|
# +--------+-----------+

PySpark 过滤器 DataFrame 其中一个列中的值在另一个 DataFrame 列中不存在

问题描述

3 个解决方案

解决方案1
0 2022-01-18 16:17:08

解决方案2
0 2022-01-18 16:30:06

解决方案3
0 2022-01-18 19:29:06

PySpark 过滤器 DataFrame 其中一个列中的值在另一个 DataFrame 列中不存在

问题描述

3 个解决方案

解决方案1 0 2022-01-18 16:17:08

解决方案2 0 2022-01-18 16:30:06

解决方案3 0 2022-01-18 19:29:06

解决方案1
0 2022-01-18 16:17:08

解决方案2
0 2022-01-18 16:30:06

解决方案3
0 2022-01-18 19:29:06