简体   繁体   English

如何根据某些条件为火花数据框中的记录分配排名?

[英]How to assign ranks to records in a spark dataframe based on some conditions?

Given a dataframe : 给定一个数据帧:

+-------+-------+
|   A   |   B   |
+-------+-------+
|      a|      1|
+-------+-------+
|      b|      2|
+-------+-------+
|      c|      5|
+-------+-------+
|      d|      7|
+-------+-------+
|      e|     11|
+-------+-------+    

I want to assign ranks to records based on conditions : 我想根据条件为记录分配排名:

  1. Start rank with 1 以1开始排名
  2. Assign rank = rank of previous record if ( B of current record - B of previous record ) is <= 2 如果(当前记录的B - 先前记录的B)<= 2,则分配等级=先前记录的等级
  3. Increment rank when ( B of current record - B of previous record ) is > 2 当前记录的B(前一记录的B)> 2时的增量等级

So I want result to be like this : 所以我希望结果是这样的:

+-------+-------+------+
|   A   |   B   | rank |
+-------+-------+------+
|      a|      1|     1|
+-------+-------+------+
|      b|      2|     1|
+-------+-------+------+
|      c|      5|     2|
+-------+-------+------+
|      d|      7|     2|
+-------+-------+------+
|      e|     11|     3|
+-------+-------+------+
  • Inbuilt functions in spark like rowNumber, rank, dense_rank don't provide any functionality to achieve this. 像rowNumber,rank,dense_rank这样的内置函数不提供实现此功能的任何功能。
  • I tried doing it by using a global variable rank and fetching previous record values using lag function but it does not give consistent results due to distributed processing in spark unlike in sql. 我尝试使用全局变量rank并使用滞后函数获取先前的记录值,但由于spark中的分布式处理不同于sql,因此它不能提供一致的结果。
  • One more method I tried was passing lag values of records to a UDF while generating a new column and applying conditions in UDF. 我尝试的另一种方法是在生成新列并在UDF中应用条件时将记录的滞后值传递给UDF。 But the problem I am facing is I can get lag values for columns A as well as B but not for column rank. 但我面临的问题是我可以获得列A和B的滞后值,但不能获得列排名。 This gives error as it cannot resolve column name rank : 这会产生错误,因为它无法解析列名称排名:

    HiveContext.sql("SELECT df.*,LAG(df.rank, 1) OVER (ORDER BY B , 0) AS rank_lag, udfGetVisitNo(B,rank_lag) as rank FROM df") HiveContext.sql(“SELECT df。*,LAG(df.rank,1)OVER(ORDER BY B,0)AS rank_lag,udfGetVisitNo(B,rank_lag)as rank FROM df”)

  • I cannot get lag value of a column which I am currently adding. 我无法得到我目前正在添加的列的滞后值。

  • Also I dont want methods which require using df.collect() as this dataframe is quite large in size and collecting it on a single working node results in memory errors. 此外,我不想要使用df.collect()的方法,因为这个数据帧的大小非常大,并且在单个工作节点上收集它会导致内存错误。

Any other method by which I can achieve the same? 我能达到同样的任何其他方法吗? I would like to know a solution having time complexity O(n) , n being the no of records. 我想知道一个时间复杂度为O(n)的解决方案,n是记录的编号。

A SQL solution would be 一个SQL解决方案就是

select a,b,1+sum(col) over(order by a) as rnk
from 
(
select t.*
,case when b - lag(b,1,b) over(order by a) <= 2 then 0 else 1 end as col
from t
) x

The solution assumes the ordering is based on column a . 该解决方案假定订购基于列a

SQL Server example

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何根据不同的条件为 pandas dataframe 中的特定列赋值? - How to assign value to particular column in pandas dataframe based on different conditions? python dataframe如何根据索引条件删除一些行 - python dataframe how to delete some rows based on index conditions 如何根据某些条件投射(长到宽)数据框 - How to cast (long to wide) dataframe based on some conditions 根据pandas数据框中的条件为列分配值 - Assign values to columns based on conditions in a pandas dataframe 如何根据某些条件从两个数据框中获取行数 - How to get the count of row from two dataframe based on some conditions 根据多个条件从数据框中删除记录 - Remove records from a dataframe based on multiple conditions 根据数据框的条件在python中创建记录 - Creating records in python based on conditions from dataframe 根据两个条件为来自另一个数据帧的数据帧赋值 - Assign value to dataframe from another dataframe based on two conditions 在某些条件下,如何根据不同的数据框列替换一个数据框的列值? - How can we replace the columns values of one dataframe based on different dataframe column using some conditions? 当通过复杂索引和基于布尔的条件子集时,如何为熊猫数据框分配值? - How to assign value to a pandas dataframe, when subset by complex index and boolean based conditions?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM