[英]Efficient way to sum on one column where count(value_1) / count(value_2) of another column is greater than x
I have a table of the following structure:我有一个具有以下结构的表:
| id | bool | amt |
-------------------
| 1 | 0 | 4 |
| 1 | 1 | 3 |
| 1 | 1 | 5 |
| 2 | 0 | 8 |
| 2 | 1 | 4 |
| 2 | 0 | 4 |
I want to get the sum of the amt
but only when the the ratio of bool = 1
/ bool = 0
per id
is greater than 0.6.我想得到amt
的总和,但只有当bool = 1
/ bool = 0
per id
的比率大于 0.6 时。
I have successfully done this like this:我已经成功地做到了这一点:
SELECT SUM(amt) as total_amt,
FROM table
WHERE id IN (
SELECT id
FROM table
GROUP BY id
HAVING CAST(SUM(bool) AS DOUBLE) / CAST(COUNT(bool) AS DOUBLE) > 0.6
)
However, my problem is that this is a toy simulation of my actual tables and data, and in reality it is a very large amount of data.但是,我的问题是,这是对我的实际表和数据的玩具模拟,实际上是非常大量的数据。 When I run this query on all my data, I get errors either saying that the memory limit of the cluster has been reached, or that the execution time has reached the limit.当我对所有数据运行此查询时,我收到错误消息,指出已达到集群的内存限制,或执行时间已达到限制。 If I remove the WHERE
statement which finds the id
s satisfying the ratio, then it runs without errors.如果我删除找到满足比率的id
的WHERE
语句,那么它运行时不会出错。
Before resorting to having these limits increased, is there any way I can achieve this more efficiently, either in terms of memory, execution time, or both?在诉诸增加这些限制之前,有什么方法可以更有效地实现这一目标,无论是在内存、执行时间还是两者方面?
You can use two levels of aggregation:您可以使用两个级别的聚合:
select sum(id_amount)
from (select id, sum(amount) as id_amount,
avg(case when bool then 1.0 else 0 end) as ratio
from t
group by id
) t
where ratio > 0.6;
Note: I don't have much experience with Presto.注意:我对 Presto 没有太多经验。 I think you can use:我认为你可以使用:
avg(bool)
or:或者:
avg(bool::int)
instead of the above expression.而不是上面的表达式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.