简体   繁体   中英

Why is COUNT() query from large table much faster than SUM()

I have a data warehouse with the following tables:

about 8 million records

CREATE TABLE `main` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`cid` mediumint(8) unsigned DEFAULT NULL, //This is the customer id
`iid` mediumint(8) unsigned DEFAULT NULL, //This is the item id
`pid` tinyint(3) unsigned DEFAULT NULL, //This is the period id
`qty` double DEFAULT NULL,
`sales` double DEFAULT NULL,
`gm` double DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=7978349 DEFAULT CHARSET=latin1

This table has about 50 records and has the following fields

  • id
  • month
  • year

This has about 23,000 records and the following fileds

  • id
  • number //This field is unique
  • name //This is simply a description field

The following query runs very fast (less than 1 second) and returns about 2,000:

select count(*) 
from mydb.main m 
INNER JOIN mydb.period p ON p.id = m.pid 
INNER JOIN mydb.customer c ON c.id = m.cid 
WHERE p.year = 2013 AND c.number = 'ABC';

But this query is much slower (mmore than 45 seconds), which is the same as the previous but sums instead of counts:

select sum(sales)
from mydb.main m 
INNER JOIN mydb.period p ON p.id = m.pid 
INNER JOIN mydb.customer c ON c.id = m.cid 
WHERE p.year = 2013 AND c.number = 'ABC';

When I explain each query, the ONLY difference I see is that on the 'count()' query the 'Extra' field says 'Using index', while for the 'sum()' query this field is NULL.

| id | select_type | table | type  | possible_keys        | key          | key_len | ref                 | rows | Extra       |
|  1 | SIMPLE      | c     | const | PRIMARY,idx_customer | idx_customer | 11      | const               |    1 | Using index |
|  1 | SIMPLE      | p     | ref   | PRIMARY,idx_period   | idx_period   | 4       | const               |    6 | Using index |
|  1 | SIMPLE      | m     | ref   | idx_pci,idx_pic      | idx_pci      | 6       | mydb.p.id,const     |    7 | Using index |

| id | select_type | table | type  | possible_keys        | key          | key_len | ref                 | rows | Extra       |
|  1 | SIMPLE      | c     | const | PRIMARY,idx_customer | idx_customer | 11      | const               |    1 | Using index |
|  1 | SIMPLE      | p     | ref   | PRIMARY,idx_period   | idx_period   | 4       | const               |    6 | Using index |
|  1 | SIMPLE      | m     | ref   | idx_pci,idx_pic      | idx_pci      | 6       | mydb.p.id,const     |    7 | NULL        |
  • Why is the count() so much faster than sum()? Shouldn't it be using the index for both?
  • What can I do to make the sum() go faster?

Thanks in advance!

All the tables show that it is using Engine InnoDB

Also, as a side note, if I just do a 'SELECT *' query, this runs very quickly (less than 2 seconds). I would expect that the 'SUM()' shouldn't take any longer than that since SELECT * has to retrieve the rows anyways...

This is what I've learned:

  • Since the sales field is not a part of the index, it has to retrieve the records from the hard drive (which can be kind've slow).
  • I'm not too familiar with this, but it looks like I/O performance can be increased by switching to a SSD (Solid-state drive). I'll have to research this more.
  • For now, I think I'm going to create another layer of summary in order to get the performance I'm looking for.
  • I redefined my index on the main table to be (pid,cid,iid,sales,gm,qty) and now the sum() queries are running VERY fast!

Thanks everybody!

The index is the list of key rows.

When you do the count() query the actual data from the database can be ignored and just the index used.

When you do the sum(sales) query, then each row has to be read from disk to get the sales figure, hence much slower.

Additionally, the indexes can be read in bulk and then processed in memory, while the disk access will be randomly trashing the drive trying to read rows from across the disk.

Finally, the index itself may have summaries of the counts (to help with the plan generation)

Update

You actually have three indexes on your table:

PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH

So you only have indexes on the columns id , pid , cid , iid . (As an aside, most databases are smart enough to combine indexes, so you could probably optimize your indexes somewhat)

If you added another key like KEY idx_sales(id,sales) that could improve performance, but given the likely distribution of sales values numerically, you would be adding extra performance cost for updates which is likely a bad thing

The simple answer is that count() is only counting rows. This can be satisfied by the index.

The sum() needs to identify each row and then fetch the page in order to get the sales column. This adds a lot of overhead -- about one page load per row.

If you add sales into the index, then it should also go very fast, because it will not have to fetch the original data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM