简体   繁体   中英

How to compare row value with previous row value?

I have the following PySpark dataframe.

ID
7773
7773
7372
7372
2032
2032
2032

I need to compare the first row value with the next row value. If I was doing this in pandas I would be using the following logic:

for i in range(df.shape[0]-1):
   if df.iloc[i]['ID'] == df.iloc[i+1]['ID']:
       print(True)
   else:
       print(False)

How can I do the same in PySpark dataframe or SQL?

Spark is not really designed to work with rows one-by-one. It's not really efficient, as normally all the data is split to many nodes. So in order to do it, Spark needs to move all the data to one node. It seems, you could use lag window function on full table, providing just the orderBy condition.

from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame([('7773',), ('7773',), ('7372',), ('7372',), ('2032',), ('2032',), ('2032',)], ['ID'])

w = W.orderBy(F.desc('ID'))
df = df.withColumn('equals_earlier', F.col('ID') == F.lag('ID').over(w))

df.show()
# +----+--------------+
# |  ID|equals_earlier|
# +----+--------------+
# |7773|          null|
# |7773|          true|
# |7372|         false|
# |7372|          true|
# |2032|         false|
# |2032|          true|
# |2032|          true|
# +----+--------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM