在 SQL/Spark 中使用 Window 函数执行特定过滤器

Question

I currently have a large dataset, but for simplicity, it looks like this:我目前有一个大型数据集，但为简单起见，它看起来像这样：

Person, Friend, Friendship_Score, Days_Known
Alice, Bob, 120, 56
Alice, Candy, 20, 23
Bob, Daniel, 24, 77
Bob, Alice, 120, 56
Candy, Alice, 20, 23
Daniel, Bob, 24, 77
Daniel, Ed, 56, 65
Daniel, Fin, 52, 54
Daniel, Gin, 22, 50
...

I want to use a window function on this dataset to make it look like this:我想在这个数据集上使用 window function 使它看起来像这样：

Alice, Bob, 120, 56
Bob, Daniel, 24, 77
Bob, Alice, 120, 56
Candy, Alice, 20, 23
Daniel, Bob, 24, 77
Daniel, Ed, 56, 65
Daniel, Fin, 52, 54

The logic behind the filter should be that for each person, we rank their friends in the order of how long they've known each other for (higher days_known value is at the top) and then only keep enough friends such that they have a friendship_score of 100.过滤器背后的逻辑应该是，对于每个人，我们按照他们认识的时间长短对他们的朋友进行排名（较高的days_known值在顶部），然后只保留足够多的朋友，使他们有friendship_score 100 个。

For example, Alice would only need Bob because she has known him the longest and they have a friendship_score over 100. Bob would need both Daniel and Alice because Bob has known Daniel longer, but their friendship_score is only 24. However, after adding Alice, the next friend Bob has known the longest, the combined friendship_score is above 100.例如，Alice 只需要 Bob，因为她认识他的时间最长，并且他们的friendship_score超过 100。Bob 需要 Daniel 和 Alice，因为 Bob 认识 Daniel 的时间更长，但他们的friendship_score只有 24。但是，在添加 Alice 之后， Bob 认识时间最长的下一个朋友， friendship_score总和超过 100。

I know we need to do some kind of window function and a rolling sum, but I am having trouble putting the ideas into code and was wondering if anyone could help with this.我知道我们需要做一些 window function 和滚动总和，但我无法将这些想法转化为代码，想知道是否有人可以帮助解决这个问题。 Thank you!谢谢！

Answer 1

I don't have much experience with Spark but the docs indicate it supports both window functions and Selects From Selects which you will need to filter the result of the window function.我对 Spark 没有太多经验，但文档表明它支持 window 函数和 Selects From Selects，您需要过滤 window function 的结果。

Notice that the running sum using window UNBOUNDED BELOW TO CURRENT ROW produces a sum bigger than 100 for the last record you want to keep.请注意，使用 window UNBOUNDED BELOW TO CURRENT ROW 的运行总和会为您要保留的最后一条记录生成大于 100 的总和。 You really want a partial sum that does not include CURRENT ROW to filter and retain the right records.你真的想要一个不包括当前行的部分总和来过滤和保留正确的记录。 You can do this with the SUM window function and then subtract the score from current record.您可以使用 SUM window function 执行此操作，然后从当前记录中减去分数。 So your window function should read SUM(friendship_score) OVER (Partition By person Order By Days_Known desc ROWS UNBOUNDED BELOW TO CURRENT ROW) - friendship_score as prior_total_score所以你的 window function 应该读 SUM(friendship_score) OVER (Partition By person Order By Days_Known desc ROWS UNBOUNDED BELOW TO 当前 ROW) - Friendship_score 作为prior_total_score

Select person, friend, friendship_score,days_known
From (
      Select *, SUM(friendship_score) OVER (Partition By person Order By Days_Known desc ROWS UNBOUNDED BELOW TO CURRENT ROW) - friendship_score as prior_total_score
      From MyTable
     )
Where prior_total_score < 100

You can add an Order By to the outer Select as desired.您可以根据需要将 Order By 添加到外部 Select。

在 SQL/Spark 中使用 Window 函数执行特定过滤器

问题描述

1 个解决方案

解决方案1
0 2021-05-30 02:04:01

在 SQL/Spark 中使用 Window 函数执行特定过滤器

问题描述

1 个解决方案

解决方案1 0 2021-05-30 02:04:01

解决方案1
0 2021-05-30 02:04:01