简体   繁体   中英

How to quickly check if row exists in PySpark Dataframe?

I have a PySpark dataframe like this:

+------+------+
|     A|     B|
+------+------+
|     1|     2|
|     1|     3|
|     2|     3|
|     2|     5|
+------+------+

I want to do a lookup on the table to see if a specific row exists. For example, for the test of A = 2 , B = 5 the code should return True and for A = 2 , B = 10 the code should return False .

I tried this:

df[(df['A'] == 1) & (df['B'] == 2)].rdd.isEmpty()

Unfortunately, this code takes a long time to execute, and since this is a lookup that will be performed many times (for different values of A and B), I would like to have a quicker method of accomplishing this task.

Other solutions that I am considering are:

  • Converting the PySpark dataframe to a Pandas dataframe because the row lookups are faster
  • Using .where() or .filter() though from what I have tried, I do not anticipate either being substantially faster
  • Using .count() over isEmpty()

It would be better to create a spark dataframe from the entries that you want to look up, and then do a semi join or an anti join to get the rows that exist or do not exist in the lookup dataframe. This should be more efficient than checking the entries one by one.

import pyspark.sql.functions as F

df = spark.createDataFrame([[2,5],[2,10]],['A','B'])

result1 = df.join(lookup, ['A','B'], 'semi').withColumn('exists', F.lit(True))

result2 = df.join(lookup, ['A','B'], 'anti').withColumn('exists', F.lit(False))

result = result1.unionAll(result2)

result.show()
+---+---+------+
|  A|  B|exists|
+---+---+------+
|  2|  5|  true|
|  2| 10| false|
+---+---+------+

Spark function ANY offers a very quick way to check if a record exists inside a dataframe.

check = df.selectExpr('ANY((A = 2) AND (B = 5)) as chk')

check.show()
#  +----+
#  | chk|
#  +----+
#  |true|
#  +----+

check = df.selectExpr('ANY((A = 2) AND (B = 10)) as chk')
check.show()
#  +-----+
#  |  chk|
#  +-----+
#  |false|
#  +-----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM