简体   繁体   English

在 aws athena 中计算百分位数

[英]calculating percentiles in aws athena

result of my query is being used in aws quicksight.我的查询结果正在 aws quicksight 中使用。 even though quicksight offers percentileCont() which does the job for us I want to use it in the query instead of using calculated field.尽管 quicksight 提供 percentileCont() 为我们完成这项工作,但我想在查询中使用它而不是使用计算字段。

eventually what I want to do is create a point column where最终我想做的是创建一个点列

under 25percentile -> 0
under 50 percentile -> 1 
under 75 percentile -> 2
rest -> 3

depending on a column that ranges from [a, b].取决于范围为 [a, b] 的列。

Right now I find out value at each percentile and manually create a range现在我找出每个百分位数的值并手动创建一个范围

With table as (
    SELECT *
         , cast(date_diff('day', last_transaction, current_date) as double) as col
)
SELECT *
     , case 
         when col between 0 and 25 then 0
         when col between 26 and 66 then 1
         when col between 67 and 193 then 2
         when col >= 194 then 3
       end as point
  FROM table;

however I want to make it dynamic so instead of [0,25] would be something like [min(col), 25percentile(col)].但是我想让它动态化,而不是 [0,25] 将类似于 [min(col), 25percentile(col)]。

above query outputs以上查询输出

col   point
333     3
166     2
 96     1
 .

With NTILE() added Thanks to @Gordon Linoff添加了 NTILE() 感谢@Gordon Linoff

With table as (
    SELECT *
         , cast(date_diff('day', last_transaction, current_date) as double) as col
)
SELECT *
     , case 
         when col between 0 and 25 then 0
         when col between 26 and 66 then 1
         when col between 67 and 193 then 2
         when col >= 194 then 3
       end as point
      , NTILE(4) over(order by col) as pt
  FROM table;

outputs输出

col   point
0     1
0     1
0     1
 .

seems to mess up col calculation似乎弄乱了 col 计算

You are pretty much describing the ntile() function:您几乎在描述ntile() function:

SELECT t.*,,
       NTILE(4) OVER (ORDER BY col) - 1 as point
FROM table;

Two caveats:两个警告:

  • NTILE(<n>) returns values between 1 and n NTILE(<n>)返回 1 到 n 之间的值
  • NTILE() makes sure the resulting tiles are equal. NTILE()确保生成的图块相等。 That means that values on the boundaries could end up in different bins.这意味着边界上的值最终可能会出现在不同的 bin 中。

An alternative that puts values in separate bins (but the bins might have different sizes) is percent_rank() .将值放在单独的 bin 中(但 bin 可能有不同的大小)的替代方法是percent_rank() In your case:在你的情况下:

SELECT t.*,,
       CEILING(PRECENT_RANK() OVER (ORDER BY col) * 4) - 1 as point
FROM table;

In Presto, I think that approx_percentile() and a case expression can do what you want:在 Presto 中,我认为approx_percentile()case表达式可以做你想做的事:

select t.*, 
    case
        when col <= approx_percentile(col, 0.25) over() then 0
        when col <= approx_percentile(col, 0.50) over() then 1
        when col <= approx_percentile(col, 0.75) over() then 2
        else 3
    end as point
from mytable t

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM