简体   繁体   中英

How to assign ranks to records in a spark dataframe based on some conditions?

Given a dataframe :

+-------+-------+
|   A   |   B   |
+-------+-------+
|      a|      1|
+-------+-------+
|      b|      2|
+-------+-------+
|      c|      5|
+-------+-------+
|      d|      7|
+-------+-------+
|      e|     11|
+-------+-------+    

I want to assign ranks to records based on conditions :

  1. Start rank with 1
  2. Assign rank = rank of previous record if ( B of current record - B of previous record ) is <= 2
  3. Increment rank when ( B of current record - B of previous record ) is > 2

So I want result to be like this :

+-------+-------+------+
|   A   |   B   | rank |
+-------+-------+------+
|      a|      1|     1|
+-------+-------+------+
|      b|      2|     1|
+-------+-------+------+
|      c|      5|     2|
+-------+-------+------+
|      d|      7|     2|
+-------+-------+------+
|      e|     11|     3|
+-------+-------+------+
  • Inbuilt functions in spark like rowNumber, rank, dense_rank don't provide any functionality to achieve this.
  • I tried doing it by using a global variable rank and fetching previous record values using lag function but it does not give consistent results due to distributed processing in spark unlike in sql.
  • One more method I tried was passing lag values of records to a UDF while generating a new column and applying conditions in UDF. But the problem I am facing is I can get lag values for columns A as well as B but not for column rank. This gives error as it cannot resolve column name rank :

    HiveContext.sql("SELECT df.*,LAG(df.rank, 1) OVER (ORDER BY B , 0) AS rank_lag, udfGetVisitNo(B,rank_lag) as rank FROM df")

  • I cannot get lag value of a column which I am currently adding.

  • Also I dont want methods which require using df.collect() as this dataframe is quite large in size and collecting it on a single working node results in memory errors.

Any other method by which I can achieve the same? I would like to know a solution having time complexity O(n) , n being the no of records.

A SQL solution would be

select a,b,1+sum(col) over(order by a) as rnk
from 
(
select t.*
,case when b - lag(b,1,b) over(order by a) <= 2 then 0 else 1 end as col
from t
) x

The solution assumes the ordering is based on column a .

SQL Server example

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM