How to compare row value with previous row value?

Question

I have the following PySpark dataframe.

ID
7773
7773
7372
7372
2032
2032
2032

I need to compare the first row value with the next row value. If I was doing this in pandas I would be using the following logic:

for i in range(df.shape[0]-1):
   if df.iloc[i]['ID'] == df.iloc[i+1]['ID']:
       print(True)
   else:
       print(False)

How can I do the same in PySpark dataframe or SQL?

Answer 1

Spark is not really designed to work with rows one-by-one. It's not really efficient, as normally all the data is split to many nodes. So in order to do it, Spark needs to move all the data to one node. It seems, you could use lag window function on full table, providing just the orderBy condition.

from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame([('7773',), ('7773',), ('7372',), ('7372',), ('2032',), ('2032',), ('2032',)], ['ID'])

w = W.orderBy(F.desc('ID'))
df = df.withColumn('equals_earlier', F.col('ID') == F.lag('ID').over(w))

df.show()
# +----+--------------+
# |  ID|equals_earlier|
# +----+--------------+
# |7773|          null|
# |7773|          true|
# |7372|         false|
# |7372|          true|
# |2032|         false|
# |2032|          true|
# |2032|          true|
# +----+--------------+

How to compare row value with previous row value?

Question

1 answers

solution1
0 2022-05-27 05:59:57

How to compare row value with previous row value?

Question

1 answers

solution1 0 2022-05-27 05:59:57

solution1
0 2022-05-27 05:59:57