简体   繁体   English

优化选择计数(DISTINCT ip)

[英]Optimizing SELECT count(DISTINCT ip)

I'm trying to get aggregated results (total unique IPs) from a table with about 2M new rows every day. 我正在尝试从每天有大约200万新行的表中获取汇总结果(总唯一IP)。

The table: 桌子:

CREATE TABLE `clicks` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `hash` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `type` enum('popunder','gallery','exit','direct') COLLATE utf8_unicode_ci NOT NULL,
  `impression_time` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `source_user_id` int(11) NOT NULL,
  `destination_user_id` int(11) NOT NULL,
  `destination_campaign_id` int(11) NOT NULL,
  `destination_campaign_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `destination_campaign_url` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `ip` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `referrer` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `country_code` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `country_id` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `country` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `isp` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `category_id` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `category` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `bid` float(8,2) NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (`id`),
  KEY `ip` (`ip`),
  KEY `source_user_id` (`source_user_id`),
  KEY `destination_user_id` (`destination_user_id`),
  KEY `destination_campaign_id` (`destination_campaign_id`),
  KEY `clicks_hash_index` (`hash`),
  KEY `clicks_created_at_index` (`created_at`),
  KEY `campaign_date` (`destination_campaign_id`,`created_at`),
  KEY `source_user_date` (`source_user_id`,`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=301539660 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

My query: 我的查询:

SELECT SUM(ips_by_date.count) as count, ips_by_date.date as date
FROM (SELECT count(DISTINCT ip) as count, DATE(created_at) as date 
      FROM clicks as clicks 
      WHERE created_at BETWEEN '2016-05-22 00:00:00' AND '2016-05-23 23:59:59' 
      GROUP BY DATE(created_at)) as ips_by_date 
GROUP BY date;

Now, this query took 93 seconds to run for just one day and I feel like I'm missing something. 现在,此查询只花了93秒的时间就运行了一天,我感觉自己丢失了一些东西。

Is there any optimization I can make to speed up the performance of this simple count? 我可以做些优化来加快此简单计数的性能吗?

Thank you. 谢谢。

First, I don't see why a subquery is necessary. 首先,我不明白为什么需要子查询。 The inner query has one row per date. 内部查询每个日期有一行。 There is no need to aggregate again. 无需再次聚合。 Second, your query is for two days, but I get the points about performance. 其次,您的查询持续了两天,但我得到了有关性能的要点。

So, let's start with: 因此,让我们开始:

SELECT count(DISTINCT ip) as count, DATE(created_at) as date 
FROM clicks  
WHERE created_at BETWEEN '2016-05-22 00:00:00' AND '2016-05-23 23:59:59' 
GROUP BY DATE(created_at);

For this query, you want an index on clicks(created_at, ip) . 对于此查询,您希望获得clicks(created_at, ip)的索引。 Note also that I would write this as: 另请注意,我将其写为:

SELECT count(DISTINCT ip) as count, DATE(created_at) as date 
FROM clicks  
WHERE created_at >= '2016-05-22' AND created_at < '2016-05-24' 
GROUP BY DATE(created_at);

This should show some improvement, but I don't think it will be radically better because a file sort is still necessary for the outer aggregation. 这应该显示出一些改进,但是我认为它根本不会更好,因为外部聚合仍然需要文件排序。

The performance here boils down to the efficiency of your indexes, since there is not much room for changes in your code (see Gordons code for a cleaner version of your code). 这里的性能归结为索引的效率,因为代码中没有太大的更改空间(有关代码的更干净版本,请参阅Gordons代码)。

An index on (created_at) or (created_at, ip) will unfortunatley not directly give you distinct ip without further sorting (since you don't group by created_at ), but the latter at least would not require direct table access. (created_at)(created_at, ip)上的索引很遗憾不会在不进行进一步排序的情况下直接为您提供distinct ip (因为您没有按created_at分组),但是后者至少不需要直接表访问。 So the next optimization would require an index on (date(created_at), ip) , even though that would mean some duplication of data. 因此,下一个优化将需要在(date(created_at), ip)上建立索引,即使这将意味着某些数据重复。

As of mysql 5.7.6, you can use a generated column to create a column dt as date(created_at) , before 5.7.6, just create a column dt and update it manually (if you ever change your create_at -value, you have to add a trigger to update that column accordingly). 从mysql 5.7.6开始,您可以使用生成的列将列dt创建as date(created_at) ,在5.7.6之前,只需创建列dt并手动对其进行更新(如果您更改了create_at -value,则可以添加触发器以相应地更新该列)。 Your initial update might take a while, so update in batches or consider just using it for future querys. 您的初始更新可能需要一段时间,因此请分批更新或考虑将其用于将来的查询。

Adding an index (dt, ip) should now give you the result with a single index/range scan and without filesort and without the need to calculate date() from datetime: 现在,添加索引(dt, ip)应该可以通过单个索引/范围扫描并且无需文件排序并且无需从datetime计算date()即可得到结果:

select count(distinct ip) as count, dt 
from clicks  
where dt >= '2016-05-22' and dt < '2016-05-24' 
group by dt;

If everything works fine, this should take you just some seconds even for some million rows. 如果一切正常,那么即使是几百万行,这也只需几秒钟。

Some things that still might cause you troubles: Since 90 seconds is still a relatively big number for 2 million rows, it might indicate that you have issues with buffer sizes / ram / hdd. 有些事情仍然可能引起您的麻烦:由于90秒对于200万行而言仍然是一个相对较大的数字,这可能表明您在缓冲区大小/内存/硬盘驱动器方面存在问题。 If it takes you eg 80 seconds to rebuffer and load your index into memory, there is not much an index can do after that. 例如,如果您花费80秒来重新缓冲索引并将索引加载到内存中,则此后索引无能为力。 An easy test for this: run your query twice. 一个简单的测试:两次运行查询。 If it is (really) significantly faster the second time (like << 1/10th), then you might have to think about tweaking your system settings, architecture or partitioning. 如果第二次(确实)快得多(例如<< 1/10),那么您可能不得不考虑调整系统设置,体系结构或分区。 Having said that, you should not tweak your system (and sometimes not even add another index or a date column) for query like this and maybe slow down other, more important things - to get daily statistics, you could just as easily run a task at midnight for all statistics you can think of and save the results for you to look at in the morning nice and easy, it would not matter if it takes hours for your query to run. 话虽如此,您不应该调整系统(有时甚至不添加其他索引或日期列)以进行此类查询,并且可能减慢其他更重要的事情-要获取每日统计信息,您可以轻松地运行任务在午夜时分,您可以考虑并保存所有统计信息,方便您在早上查看,这很容易,查询是否需要花费几个小时即可运行。

First add the composite index already mentioned. 首先添加已经提到的复合索引。 Then the real performance problem will be reading a zillion rows to compute a COUNT(DISTINCT...) . 然后,真正的性能问题将是读取不计其数的行以计算COUNT(DISTINCT...) That action requires either collecting all the values, sorting and doing a GROUP BY , or trying to keep all the distinct values in RAM. 该操作需要收集所有值,对它们进行排序并进行GROUP BY ,或者尝试将所有不同的值保留在RAM中。

Summary tables are wonderful for speeding up SUM , COUNT , and even AVG in Data Warehousing applications. 摘要表非常适合在数据仓库应用程序中加快SUMCOUNT甚至AVG速度。 But COUNT(DISTINCT...) (aka "count unique users") does not lend itself to Summary tables. 但是COUNT(DISTINCT...) (也称为“计数唯一用户”)并不适合于摘要表。 If you are willing to accept a small error, there is a way. 如果您愿意接受一个小错误,那么有一种方法。 See my blog . 我的博客

You may not realize it, but the blanket use of 255 in VARCHAR sometimes causes unnecessary performance problems. 您可能没有意识到,但是在VARCHAR使用255 有时会导致不必要的性能问题。 In this case, you have ip taking 765 bytes in any tmp table, perhaps in the query in question. 在这种情况下,您可能会在所查询中的任何tmp表中ip占用765个字节。 Changing it to VARCHAR(39) CHARACTER SET ascii would cut that back by a factor of 20! 将其更改为VARCHAR(39) CHARACTER SET ascii可以将其减少20倍! (It is hard to predict how much, if any, that will speed up your query. You could get it down to BINARY(16) with a simple stored function. (很难预测将提高查询速度的程度,如果有的话。您可以使用简单的存储函数将其降至BINARY(16)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM