![](/img/trans.png)
[英]Pyspark DataFrame Filter column based on a column in another DataFrame without join
[英]PySpark filter DataFrame where values in a column do not exist in another DataFrame column
我不明白为什么这在 PySpark 中不起作用......
我正在尝试根据列值将数据拆分为已approved
的 DataFrame 和rejected
的 DataFrame。 因此, rejected
查看approved
的language
列值,仅返回已approved
的 DataFrame 的language
列中不存在该language
的行:
# Data
columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]
df = spark.createDataFrame(data, columns)
df.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# | Java| 20000|
# | Python| 100000|
# | Scala| 3000|
# | C++| 10000|
# | C#| 32195432|
# | C| 238135|
# | R| 134315|
# | Ruby| 235|
# | C| 1000|
# | R| 2000|
# | Ruby| 4000|
# +--------+-----------+
# Approved
is_approved = df.users_count > 10000
df_approved = df.filter(is_approved)
df_approved.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# | Java| 20000|
# | Python| 100000|
# | C#| 32195432|
# | C| 238135|
# | R| 134315|
# +--------+-----------+
# Rejected
is_not_approved = ~df.language.isin(df_approved.language)
df_rejected = df.filter(is_not_approved)
df_rejected.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# +--------+-----------+
# Also tried
df.filter( ~df.language.contains(df_approved.language) ).show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# +--------+-----------+
所以这没有任何意义 - 为什么df_rejected
是空的?
使用其他方法的预期结果:
SQL:
SELECT * FROM df
WHERE language NOT IN ( SELECT language FROM df_approved )
Python:
data_approved = []
for language, users_count in data:
if users_count > 10000:
data_approved.append((language, users_count))
data_rejected = []
for language, users_count in data:
if language not in [row[0] for row in data_approved]:
data_rejected.append((language, users_count))
print(data_approved)
print(data_rejected)
# [('Java', 20000), ('Python', 100000), ('C#', 32195432), ('C', 238135), ('R', 134315)]
# [('Scala', 3000), ('C++', 10000), ('Ruby', 235), ('Ruby', 4000)]
为什么 PySpark 没有按预期过滤?
首先,您需要使用window
到 select 的最大user_count
of rows by language
。
from pyspark.sql import Window
columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]
df = spark.createDataFrame(data, columns)
df = (df.withColumn('max_users_count',
functions.max('users_count')
.over(w))
.where(functions.col('users_count')
== functions.col('max_users_count'))
.drop('max_users_count'))
df.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
| C#| 32195432|
| C++| 10000|
| C| 238135|
| R| 134315|
| Scala| 3000|
| Ruby| 4000|
| Python| 100000|
| Java| 20000|
+--------+-----------+
然后您可以根据指定的条件进行过滤。
is_approved = df.users_count > 10000
df_approved = df.filter(is_approved)
df_approved.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| C#| 32195432|
| C| 238135|
| R| 134315|
+--------+-----------+
然后对于条件的反转,在过滤器语句中添加~
符号
is_not_approved = df.filter(~is_approved)
is_not_approved.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
| Scala| 3000|
| C++| 10000|
| Ruby| 235|
| C| 1000|
| R| 2000|
| Ruby| 4000|
+--------+-----------+
尝试:
df.subtract(df_approved).show()
+--------+-----------+
|language|users_count|
+--------+-----------+
| R| 2000|
| Ruby| 4000|
| Scala| 3000|
| C| 1000|
| C++| 10000|
| Ruby| 235|
+--------+-----------+
走 SQL 路线:
columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]
df = spark.createDataFrame(data, columns)
df_approved = df.filter(df.users_count > 10000)
df.createOrReplaceTempView("df")
df_approved.createOrReplaceTempView("df_approved")
df_not_approved = spark.sql("""
SELECT * FROM df WHERE NOT EXISTS (
SELECT 1 FROM df_approved
WHERE df.language = df_approved.language
)
""")
df_not_approved.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# | C++| 10000|
# | Ruby| 235|
# | Ruby| 4000|
# | Scala| 3000|
# +--------+-----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.