[英]How to quickly check if row exists in PySpark Dataframe?
I have a PySpark dataframe like this:我有一个 PySpark dataframe 像这样:
+------+------+
| A| B|
+------+------+
| 1| 2|
| 1| 3|
| 2| 3|
| 2| 5|
+------+------+
I want to do a lookup on the table to see if a specific row exists.我想对表进行查找以查看是否存在特定行。 For example, for the test of
A = 2
, B = 5
the code should return True
and for A = 2
, B = 10
the code should return False
.例如,对于
A = 2
, B = 5
的测试,代码应该返回True
,对于A = 2
, B = 10
,代码应该返回False
。
I tried this:我试过这个:
df[(df['A'] == 1) & (df['B'] == 2)].rdd.isEmpty()
Unfortunately, this code takes a long time to execute, and since this is a lookup that will be performed many times (for different values of A and B), I would like to have a quicker method of accomplishing this task.不幸的是,这段代码需要很长时间才能执行,而且由于这是一个将执行多次的查找(对于 A 和 B 的不同值),我希望有一个更快的方法来完成这项任务。
Other solutions that I am considering are:我正在考虑的其他解决方案是:
.where()
or .filter()
though from what I have tried, I do not anticipate either being substantially faster.where()
或.filter()
虽然我已经尝试过,但我预计两者都不会快得多.count()
over isEmpty()
isEmpty()
上使用.count()
) It would be better to create a spark dataframe from the entries that you want to look up, and then do a semi join
or an anti join
to get the rows that exist or do not exist in the lookup dataframe.最好从要查找的条目中创建火花 dataframe,然后执行
semi join
联接或anti join
联接以获取查找 dataframe 中存在或不存在的行。 This should be more efficient than checking the entries one by one.这应该比逐个检查条目更有效。
import pyspark.sql.functions as F
df = spark.createDataFrame([[2,5],[2,10]],['A','B'])
result1 = df.join(lookup, ['A','B'], 'semi').withColumn('exists', F.lit(True))
result2 = df.join(lookup, ['A','B'], 'anti').withColumn('exists', F.lit(False))
result = result1.unionAll(result2)
result.show()
+---+---+------+
| A| B|exists|
+---+---+------+
| 2| 5| true|
| 2| 10| false|
+---+---+------+
Spark function ANY
offers a very quick way to check if a record exists inside a dataframe. Spark function
ANY
提供了一种非常快速的方法来检查 dataframe 中是否存在记录。
check = df.selectExpr('ANY((A = 2) AND (B = 5)) as chk')
check.show()
# +----+
# | chk|
# +----+
# |true|
# +----+
check = df.selectExpr('ANY((A = 2) AND (B = 10)) as chk')
check.show()
# +-----+
# | chk|
# +-----+
# |false|
# +-----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.