I have a PySpark dataframe like this:
+------+------+
| A| B|
+------+------+
| 1| 2|
| 1| 3|
| 2| 3|
| 2| 5|
+------+------+
I want to do a lookup on the table to see if a specific row exists. For example, for the test of A = 2
, B = 5
the code should return True
and for A = 2
, B = 10
the code should return False
.
I tried this:
df[(df['A'] == 1) & (df['B'] == 2)].rdd.isEmpty()
Unfortunately, this code takes a long time to execute, and since this is a lookup that will be performed many times (for different values of A and B), I would like to have a quicker method of accomplishing this task.
Other solutions that I am considering are:
.where()
or .filter()
though from what I have tried, I do not anticipate either being substantially faster.count()
over isEmpty()
It would be better to create a spark dataframe from the entries that you want to look up, and then do a semi join
or an anti join
to get the rows that exist or do not exist in the lookup dataframe. This should be more efficient than checking the entries one by one.
import pyspark.sql.functions as F
df = spark.createDataFrame([[2,5],[2,10]],['A','B'])
result1 = df.join(lookup, ['A','B'], 'semi').withColumn('exists', F.lit(True))
result2 = df.join(lookup, ['A','B'], 'anti').withColumn('exists', F.lit(False))
result = result1.unionAll(result2)
result.show()
+---+---+------+
| A| B|exists|
+---+---+------+
| 2| 5| true|
| 2| 10| false|
+---+---+------+
Spark function ANY
offers a very quick way to check if a record exists inside a dataframe.
check = df.selectExpr('ANY((A = 2) AND (B = 5)) as chk')
check.show()
# +----+
# | chk|
# +----+
# |true|
# +----+
check = df.selectExpr('ANY((A = 2) AND (B = 10)) as chk')
check.show()
# +-----+
# | chk|
# +-----+
# |false|
# +-----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.