简体   繁体   English

sum()与count()

[英]sum() vs. count()

Consider a voting system implemented in PostgreSQL, where each user can vote up or down on a "foo". 考虑在PostgreSQL中实现的投票系统,其中每个用户可以在“foo”上向上或向下投票。 There is a foo table that stores all the "foo information", and a votes table that stores the user_id , foo_id , and vote , where vote is +1 or -1. 有一个foo表,用于存储所有的“富信息”,以及votes存储表user_idfoo_id ,并vote ,其中vote是+1或-1。

To get the vote tally for each foo, the following query would work: 要获得每个foo的投票结果,以下查询将起作用:

SELECT sum(vote) FROM votes WHERE foo.foo_id = votes.foo_id;

But, the following would work just as well: 但是,以下内容也可以正常工作:

(SELECT count(vote) FROM votes 
 WHERE foo.foo_id = votes.foo_id 
 AND votes.vote = 1)
- (SELECT count(vote) FROM votes 
   WHERE foo.foo_id = votes.foo_id 
   AND votes.vote = (-1))

I currently have an index on votes.foo_id . 我目前在votes.foo_id上有一个索引。

Which is a more efficient approach? 哪种方法更有效? (In other words, which would run faster?) I'm interested in both the PostgreSQL-specific answer and the general SQL answer. (换句话说,哪个会运行得更快?)我对PostgreSQL特定的答案和一般的SQL答案感兴趣。

EDIT 编辑

A lot of answers have been taking into account the case where vote is null. 很多答案都考虑到vote为空的情况。 I forgot to mention that there is a NOT NULL constraint on the vote column. 我忘了提到投票列上有一个NOT NULL约束。

Also, many have been pointing out that the first is much easier to read. 此外,许多人指出,第一个更容易阅读。 Yes, it is definitely true, and if a colleague wrote the 2nd one, I would be exploding with rage unless there was a performance necessity. 是的,这绝对是真的,如果一位同事写了第二篇,我会愤怒地爆发,除非有表演的必要性。 Never the less, the question is still on the performance of the two. 从来没有,问题仍然在于两者的表现。 (Technically, if the first query was way slower, it wouldn't be such a crime to write the second query.) (从技术上来说,如果第一个查询方法要慢,它不会是这种罪行写入第二个查询。)

Of course, the first example is faster, simpler and easier to read. 当然,第一个例子更快,更简单,更容易阅读。 Should be obvious even before one gets slapped with aquatic creatures . 甚至在被水生生物拍打之前应该是显而易见的。 While sum() is slightly more expensive than count() , what matters much, much more is that the second example need two scans. 虽然sum()count()略贵,但更重要的是,第二个例子需要两次扫描。

But there is an actual difference , too: sum() can return NULL where count() doesn't. 但是也有一个实际的区别sum()可以返回NULL ,而count()则不会。 I quote the manual on aggregate functions : 我引用了关于聚合函数手册

It should be noted that except for count, these functions return a null value when no rows are selected. 应该注意,除了count之外,这些函数在没有选择行时返回空值。 In particular, sum of no rows returns null, not zero as one might expect, 特别是,没有行的总和返回null,而不是像人们预期的那样为零,

Since you seem to have a weak spot for performance optimization, here's a detail you might like: count(*) is slightly faster than count(vote) . 由于您似乎在性能优化方面存在弱点,因此这里有一个您可能会喜欢的细节: count(*)略快于count(vote) Only equivalent if vote is NOT NULL . 如果vote为NOT NULL则仅等效。 Test performance with EXPLAIN ANALYZE . 使用EXPLAIN ANALYZE测试性能。

On closer inspection 仔细检查

Both queries are syntactical nonsense, standing alone. 这两个查询都是语法上的废话,独自站立。 It only makes sense if you copied them from the SELECT list of a bigger query like: 只有从较大查询的SELECT列表中复制它们才有意义:

SELECT *, (SELECT sum(vote) FROM votes WHERE votes.foo_id = foo.foo_id)
FROM   foo;

The important point here is the correlated subquery - which may be fine if you are only reading a small fraction of votes in your query. 这里重要的一点是相关子查询 - 如果您只在查询中阅读一小部分 votes ,这可能没问题。 We would see additional WHERE conditions, and you should have matching indexes. 我们会看到其他WHERE条件,您应该有匹配的索引。

In Postgres 9.3 or later, the alternative, cleaner, 100 % equivalent solution would be with LEFT JOIN LATERAL ... ON true : 在Postgres 9.3或更高版本中,替代的,更清洁,100%等效的解决方案将使用LEFT JOIN LATERAL ... ON true

SELECT *
FROM   foo f
LEFT   JOIN LATERAL (
   SELECT sum(vote) FROM votes WHERE foo_id = f.foo_id
   ) v ON true;

Typically similar performance. 通常类似的表现。 Details: 细节:

However , while reading large parts or all from table votes , this will be (much) faster: 但是 ,在从表格votes读取大部分或全部内容时 ,这将(更快)更快:

SELECT f.*, v.score
FROM   foo f
JOIN   (
   SELECT foo_id, sum(vote) AS score
   FROM   votes
   GROUP  BY 1
   ) v USING (foo_id);

Aggregate values in a subquery first, then join to the result. 首先在子查询中聚合值,然后加入到结果中。
About USING : 关于USING

The first one will be faster. 第一个会更快。 You can try it on a simple way. 您可以通过简单的方式尝试。

Generate some data: 生成一些数据:

CREATE TABLE votes(foo_id integer, vote integer);
-- Insert 1000000 rows into 100 foos (1 to 100)
INSERT INTO votes SELECT round(random()*99)+1, CASE round(random()) WHEN 0 THEN -1 ELSE 1 END FROM generate_series(1, 1000000);
CREATE INDEX idx_votes_id ON votes (foo_id);

Check both 检查两个

EXPLAIN ANALYZE SELECT SUM(vote) FROM votes WHERE foo_id = 5;
EXPLAIN ANALYZE SELECT (SELECT COUNT(*) AS count FROM votes WHERE foo_id=5 AND vote=1) - (SELECT COUNT(*)*-1 AS count FROM votes WHERE foo_id=5 AND vote=-1);

But the truth is that they are not equivalent, to make sure the first one will work as the second, you need to treat for the null case: 但事实是,它们并不等同,为了确保第一个作为第二个,你需要对待null案例:

SELECT COALESCE(SUM(vote), 0) FROM votes WHERE foo_id = 5;

One more thing. 还有一件事。 If you are using PostgreSQL 9.2, you can create your index with both columns in it, and that way you can have a chance of using index-only scan: 如果您使用的是PostgreSQL 9.2,则可以使用其中的两列创建索引,这样您就有可能使用仅索引扫描:

CREATE INDEX idx_votes_id ON votes (foo_id, vote);

BUT! 但! In some situations this index may be worst, so you should try with both and run EXPLAIN ANALYZE to see which one is the best, or even create both and check which one PostgreSQL is using most (and exclude the other). 在某些情况下,这个索引可能是最差的,所以你应该尝试使用两个并运行EXPLAIN ANALYZE以查看哪个是最好的,或者甚至创建两个并检查哪个PostgreSQL使用最多(并排除另一个)。

I would expect the first query to work faster as this is a single query and it's more readable (handy in case you'd have to get back to this after some time). 我希望第一个查询能够更快地工作,因为这是一个单一的查询,并且它更具可读性(如果你不得不在一段时间之后再回到这个问题,那就很方便了)。

Second query consists of two queries. 第二个查询包含两个查询。 You only get a result as if it was a single query. 您只能获得一个结果,就像它是一个查询一样。

That said, to be absolutely sure which of these works better for you I would populate both tables with lots of dummy data and check the query execution time. 也就是说,为了绝对确定哪些更适合你,我会用两个表填充大量的伪数据并检查查询执行时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM