简体   繁体   English

MySQL,DISTINCT在SUM操作中

[英]MySQL, DISTINCT in SUM operation

Currently I trying to calculate number of unique user visit in my application based on user gender. 目前,我正在尝试根据用户性别来计算应用程序中唯一身份用户访问的次数。 Here is the example query that calculate all the visits (not unique) 这是计算所有访问次数(非唯一)的示例查询

SELECT
    DATE(v.visited_at) AS visit_date,
    SUM(IF(u.gender = 'M', 1, 0)) AS male_visit,
    SUM(IF(u.gender = 'F', 1, 0)) AS female_visit,
    SUM(IF(u.gender = '' OR u.gender IS NULL, 1, 0)) AS unknown_visit
FROM 
    visits v
    INNER JOIN users u ON v.user_id = u.id
WHERE
    DATE(v.visited_at) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY)
    AND v.duration > 30
GROUP BY
    DATE(v.visited_at)

Tried using subquery and count distinct it's works, but it's 4 times slower. 使用子查询进行了尝试,并计算出不同的效果,但速度慢了4倍。

SELECT
    DATE(visited_at) as visit_date,
    (SELECT COUNT(DISTINCT u.id) FROM visits v JOIN users u ON v.user_id = u.id WHERE u.gender = 'M' AND DATE(v.visited_at) = visit_date AND v.duration > 30) AS male_visit,
    (SELECT COUNT(DISTINCT u.id) FROM visits v JOIN users u ON v.user_id = u.id WHERE u.gender = 'F' AND DATE(v.visited_at) = visit_date AND v.duration > 30) AS female_visit,
    (SELECT COUNT(DISTINCT u.id) FROM visits v JOIN users u ON v.user_id = u.id WHERE u.gender = '' OR u.gender IS NULL AND DATE(v.visited_at) = visit_date AND v.duration > 30) AS unknown_visit
FROM 
    visits v
WHERE
    DATE(visited_at) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY)
GROUP BY
    DATE(visited_at)

Any suggestion on this? 有什么建议吗?

COUNT(DISTINCT) is always going to be slower than COUNT() . COUNT(DISTINCT)总是比COUNT()慢。 You can try: 你可以试试:

SELECT DATE(v.visited_at) AS visit_date,
       COUNT(DISTINCT CASE WHEN u.gender = 'M' THEN u.id END) AS male_visit,
       COUNT(DISTINCT CASE WHEN u.gender = 'F' THEN u.id END) AS female_visit,
       COUNT(DISTINCT CASE WHEN u.gender = '' OR u.gender IS NULL THEN u.id END) AS unknown_visit
FROM visits v INNER JOIN
     users u
     ON v.user_id = u.id
WHERE DATE(v.visited_at) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY) AND
      v.duration > 30
GROUP BY DATE(v.visited_at);

I don't know if it will be much faster, though. 我不知道是否会更快。

There are 2 tables as per query (user and visit) with sample data. 每个查询(用户和访问)都有2个带有示例数据的表。

用户表

参观桌

Query 询问

SELECT
DATE(v.visited_date) AS visit_date,
u.gender,
COUNT(DISTINCT v.user_id) AS total_count
FROM
visits v
INNER JOIN users u ON v.user_id = u.id
WHERE
DATE(v.visited_date) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY)
AND v.duration >= 30
GROUP BY u.gender,DATE(v.visited_date)
ORDER BY DATE(v.visited_date) ASC;

查询结果

This query will give you unique count of users gender wise for particular date. 此查询将为您提供在特定日期按性别划分的唯一用户数。

This type of query is likely to be slow, especially if you have a large number of entries in the table as when selecting rows based upon date and time values mysql has to perform a full table scan. 这种类型的查询可能会比较慢,尤其是当您在表中有大量条目时,例如当基于日期和时间值选择行时,mysql必须执行全表扫描。

Optimising your database structure is likely to offer you performance gains much in excess of anything you will get trying to query it like this. 优化数据库结构可能会为您带来比以这种方式查询数据库要多得多的性能。

A couple of suggestions would be to partition the table by date ranges. 有两个建议是按日期范围对表格进行分区。 Doing so can greatly reduce query execution as it means instead of a full table scan mysql can simply ignore any partitions outside the query date range. 这样做可以大大减少查询的执行,因为这意味着与其进行全表扫描,mysql可以简单地忽略查询日期范围以外的任何分区。 The bigger the table the more benefit you will see, but potentially anything from 2x to 10x faster I would expect. 表格越大,您将看到的好处越多,但是我期望的速度可能是2倍到10倍。

If you were to replace your gender column with 3 columns male , female and unknown you would replace 3 queries containing the slow COUNT(DISTINCT... statements with a single query with less conditions, you can also add the user id to the group by statement to remove the need to count distinct as you can specify more than one column for grouping. 如果要将“性别”列替换为“ male ,“ female和“ unknown三列,则可以用条件较少的单个查询替换包含慢速COUNT(DISTINCT ...语句)的3个查询,也可以将用户ID添加到组中语句,因为您可以为分组指定多个列,从而消除了计算非重复数的需要。

Finally you could add a database trigger and either have an extra column which it sets as 1 when logging the visits if the duration is over 30 and it's their first visit of the day, or you create a new calendar table for visits and have the trigger increment the value within that upon database write of each log which equates to a unique visit for the day. 最后,您可以添加一个数据库触发器,并且如果持续时间超过30并且是一天中的首次访问,则可以添加一个额外的列(在记录访问时将其设置为1),或者为访问创建一个新的日历表并获取触发器在数据库中写入每个日志后,将值增加为一天中的唯一访问量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM