How to assign ranks to records in a spark dataframe based on some conditions?

Question

Given a dataframe :

+-------+-------+
|   A   |   B   |
+-------+-------+
|      a|      1|
+-------+-------+
|      b|      2|
+-------+-------+
|      c|      5|
+-------+-------+
|      d|      7|
+-------+-------+
|      e|     11|
+-------+-------+

I want to assign ranks to records based on conditions :

Start rank with 1
Assign rank = rank of previous record if ( B of current record - B of previous record ) is <= 2
Increment rank when ( B of current record - B of previous record ) is > 2

So I want result to be like this :

+-------+-------+------+
|   A   |   B   | rank |
+-------+-------+------+
|      a|      1|     1|
+-------+-------+------+
|      b|      2|     1|
+-------+-------+------+
|      c|      5|     2|
+-------+-------+------+
|      d|      7|     2|
+-------+-------+------+
|      e|     11|     3|
+-------+-------+------+

Inbuilt functions in spark like rowNumber, rank, dense_rank don't provide any functionality to achieve this.
I tried doing it by using a global variable rank and fetching previous record values using lag function but it does not give consistent results due to distributed processing in spark unlike in sql.
One more method I tried was passing lag values of records to a UDF while generating a new column and applying conditions in UDF. But the problem I am facing is I can get lag values for columns A as well as B but not for column rank. This gives error as it cannot resolve column name rank :
HiveContext.sql("SELECT df.*,LAG(df.rank, 1) OVER (ORDER BY B , 0) AS rank_lag, udfGetVisitNo(B,rank_lag) as rank FROM df")
I cannot get lag value of a column which I am currently adding.
Also I dont want methods which require using df.collect() as this dataframe is quite large in size and collecting it on a single working node results in memory errors.

Any other method by which I can achieve the same? I would like to know a solution having time complexity O(n) , n being the no of records.

Answer 1

A SQL solution would be

select a,b,1+sum(col) over(order by a) as rnk
from 
(
select t.*
,case when b - lag(b,1,b) over(order by a) <= 2 then 0 else 1 end as col
from t
) x

The solution assumes the ordering is based on column a .

SQL Server example

How to assign ranks to records in a spark dataframe based on some conditions?

Question

1 answers

solution1
1 ACCPTED 2016-07-15 15:01:16

How to assign ranks to records in a spark dataframe based on some conditions?

Question

1 answers

solution1 1 ACCPTED 2016-07-15 15:01:16

solution1
1 ACCPTED 2016-07-15 15:01:16