How to quickly check if row exists in PySpark Dataframe?

Question

I have a PySpark dataframe like this:

+------+------+
|     A|     B|
+------+------+
|     1|     2|
|     1|     3|
|     2|     3|
|     2|     5|
+------+------+

I want to do a lookup on the table to see if a specific row exists. For example, for the test of A = 2 , B = 5 the code should return True and for A = 2 , B = 10 the code should return False .

I tried this:

df[(df['A'] == 1) & (df['B'] == 2)].rdd.isEmpty()

Unfortunately, this code takes a long time to execute, and since this is a lookup that will be performed many times (for different values of A and B), I would like to have a quicker method of accomplishing this task.

Other solutions that I am considering are:

Converting the PySpark dataframe to a Pandas dataframe because the row lookups are faster
Using .where() or .filter() though from what I have tried, I do not anticipate either being substantially faster
Using .count() over isEmpty()

Answer 1

It would be better to create a spark dataframe from the entries that you want to look up, and then do a semi join or an anti join to get the rows that exist or do not exist in the lookup dataframe. This should be more efficient than checking the entries one by one.

import pyspark.sql.functions as F

df = spark.createDataFrame([[2,5],[2,10]],['A','B'])

result1 = df.join(lookup, ['A','B'], 'semi').withColumn('exists', F.lit(True))

result2 = df.join(lookup, ['A','B'], 'anti').withColumn('exists', F.lit(False))

result = result1.unionAll(result2)

result.show()
+---+---+------+
|  A|  B|exists|
+---+---+------+
|  2|  5|  true|
|  2| 10| false|
+---+---+------+

Answer 2

Spark function ANY offers a very quick way to check if a record exists inside a dataframe.

check = df.selectExpr('ANY((A = 2) AND (B = 5)) as chk')

check.show()
#  +----+
#  | chk|
#  +----+
#  |true|
#  +----+

check = df.selectExpr('ANY((A = 2) AND (B = 10)) as chk')
check.show()
#  +-----+
#  |  chk|
#  +-----+
#  |false|
#  +-----+

How to quickly check if row exists in PySpark Dataframe?

Question

2 answers

solution1
2 ACCPTED 2021-02-11 07:16:47

solution2
0 2021-07-16 14:50:48

How to quickly check if row exists in PySpark Dataframe?

Question

2 answers

solution1 2 ACCPTED 2021-02-11 07:16:47

solution2 0 2021-07-16 14:50:48

solution1
2 ACCPTED 2021-02-11 07:16:47

solution2
0 2021-07-16 14:50:48