简体   繁体   English

如何快速检查 PySpark Dataframe 中是否存在行?

[英]How to quickly check if row exists in PySpark Dataframe?

I have a PySpark dataframe like this:我有一个 PySpark dataframe 像这样:

+------+------+
|     A|     B|
+------+------+
|     1|     2|
|     1|     3|
|     2|     3|
|     2|     5|
+------+------+

I want to do a lookup on the table to see if a specific row exists.我想对表进行查找以查看是否存在特定行。 For example, for the test of A = 2 , B = 5 the code should return True and for A = 2 , B = 10 the code should return False .例如,对于A = 2 , B = 5的测试,代码应该返回True ,对于A = 2 , B = 10 ,代码应该返回False

I tried this:我试过这个:

df[(df['A'] == 1) & (df['B'] == 2)].rdd.isEmpty()

Unfortunately, this code takes a long time to execute, and since this is a lookup that will be performed many times (for different values of A and B), I would like to have a quicker method of accomplishing this task.不幸的是,这段代码需要很长时间才能执行,而且由于这是一个将执行多次的查找(对于 A 和 B 的不同值),我希望有一个更快的方法来完成这项任务。

Other solutions that I am considering are:我正在考虑的其他解决方案是:

  • Converting the PySpark dataframe to a Pandas dataframe because the row lookups are faster将 PySpark dataframe 转换为 Pandas Z6A8064B5DF479455500553C47DZ50 查找速度更快50
  • Using .where() or .filter() though from what I have tried, I do not anticipate either being substantially faster使用 .where( .where().filter()虽然我已经尝试过,但我预计两者都不会快得多
  • Using .count() over isEmpty()isEmpty()上使用.count() )

It would be better to create a spark dataframe from the entries that you want to look up, and then do a semi join or an anti join to get the rows that exist or do not exist in the lookup dataframe.最好从要查找的条目中创建火花 dataframe,然后执行semi join联接或anti join联接以获取查找 dataframe 中存在或不存在的行。 This should be more efficient than checking the entries one by one.这应该比逐个检查条目更有效。

import pyspark.sql.functions as F

df = spark.createDataFrame([[2,5],[2,10]],['A','B'])

result1 = df.join(lookup, ['A','B'], 'semi').withColumn('exists', F.lit(True))

result2 = df.join(lookup, ['A','B'], 'anti').withColumn('exists', F.lit(False))

result = result1.unionAll(result2)

result.show()
+---+---+------+
|  A|  B|exists|
+---+---+------+
|  2|  5|  true|
|  2| 10| false|
+---+---+------+

Spark function ANY offers a very quick way to check if a record exists inside a dataframe. Spark function ANY提供了一种非常快速的方法来检查 dataframe 中是否存在记录。

check = df.selectExpr('ANY((A = 2) AND (B = 5)) as chk')

check.show()
#  +----+
#  | chk|
#  +----+
#  |true|
#  +----+

check = df.selectExpr('ANY((A = 2) AND (B = 10)) as chk')
check.show()
#  +-----+
#  |  chk|
#  +-----+
#  |false|
#  +-----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何快速检查pyspark数据框是否为空 - How to check if pyspark dataframe is empty QUICKLY pySpark 检查 dataframe 是否存在 - pySpark check if dataframe exists 如何快速检查pandas DataFrame索引中是否存在值? - How to quickly check if a value exists in pandas DataFrame index? 如何检查pandas数据帧中是否存在具有特定列值的行 - How to check if there exists a row with a certain column value in pandas dataframe 如果df Cell中存在“。”,如何在Dataframe中检查并过滤掉Row - How to Check and Filter out Row in Dataframe if “.” exists in df Cell 如何检查 Dataframe 的单元格是否作为 dict 中的键存在,如果存在,则检查同一行中的另一个单元格是否存在于 dict 的列表中 - How to check if a cell of a Dataframe exists as a key in a dict, and if it does, check if another cell in same row exists in a list in a dict 检查 dataframe 中具有正确值的行是否存在,如果不存在 append - Check if row with correct values in dataframe exists and append if not 是否可以在 pyspark select 数据框中检查列是否存在? - Is it possible to check if a column exists or not, inside a pyspark select dataframe? PySpark:如果存在新值,如何更新行? - PySpark: How to update row if new value exists? 如何通过pyspark检查blob是否存在 - How to check if a blob exists through pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM