[英]How to assign ranks to records in a spark dataframe based on some conditions?
Given a dataframe : 给定一个数据帧:
+-------+-------+
| A | B |
+-------+-------+
| a| 1|
+-------+-------+
| b| 2|
+-------+-------+
| c| 5|
+-------+-------+
| d| 7|
+-------+-------+
| e| 11|
+-------+-------+
I want to assign ranks to records based on conditions : 我想根据条件为记录分配排名:
So I want result to be like this : 所以我希望结果是这样的:
+-------+-------+------+
| A | B | rank |
+-------+-------+------+
| a| 1| 1|
+-------+-------+------+
| b| 2| 1|
+-------+-------+------+
| c| 5| 2|
+-------+-------+------+
| d| 7| 2|
+-------+-------+------+
| e| 11| 3|
+-------+-------+------+
One more method I tried was passing lag values of records to a UDF while generating a new column and applying conditions in UDF. 我尝试的另一种方法是在生成新列并在UDF中应用条件时将记录的滞后值传递给UDF。 But the problem I am facing is I can get lag values for columns A as well as B but not for column rank.
但我面临的问题是我可以获得列A和B的滞后值,但不能获得列排名。 This gives error as it cannot resolve column name rank :
这会产生错误,因为它无法解析列名称排名:
HiveContext.sql("SELECT df.*,LAG(df.rank, 1) OVER (ORDER BY B , 0) AS rank_lag, udfGetVisitNo(B,rank_lag) as rank FROM df") HiveContext.sql(“SELECT df。*,LAG(df.rank,1)OVER(ORDER BY B,0)AS rank_lag,udfGetVisitNo(B,rank_lag)as rank FROM df”)
I cannot get lag value of a column which I am currently adding. 我无法得到我目前正在添加的列的滞后值。
Also I dont want methods which require using df.collect() as this dataframe is quite large in size and collecting it on a single working node results in memory errors. 此外,我不想要使用df.collect()的方法,因为这个数据帧的大小非常大,并且在单个工作节点上收集它会导致内存错误。
Any other method by which I can achieve the same? 我能达到同样的任何其他方法吗? I would like to know a solution having time complexity O(n) , n being the no of records.
我想知道一个时间复杂度为O(n)的解决方案,n是记录的编号。
A SQL solution would be 一个SQL解决方案就是
select a,b,1+sum(col) over(order by a) as rnk
from
(
select t.*
,case when b - lag(b,1,b) over(order by a) <= 2 then 0 else 1 end as col
from t
) x
The solution assumes the ordering is based on column a
. 该解决方案假定订购基于列
a
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.