简体   繁体   中英

MySQL InnoDB - GROUP BY on many items

I've got a table that has id , date , ad_id , ad_network , ad_event columns. In my database there are millions of distinct ad_id each has a few events associated with them. When I try to use GROUP BY on the ad_id to count each event it takes so long there is 503 error.

I need to count distinct AdClickThru and AdImpression so that I can calculate the CTR. The problem is that one user can click many times, so I must count only one AdClickThru .

The query is below:

SELECT
    `ad_network`,
    `ad_id`,
    SUM(DISTINCT CASE WHEN `ad_event` = "AdImpression" THEN 1 ELSE 0 END) as AdImpression,
    SUM(DISTINCT CASE WHEN `ad_event` = "AdClickThru" THEN 1 ELSE 0 END) as AdClickThru
FROM `ads`
WHERE 1
    AND `ad_event` IN ("AdImpression", "AdClickThru")
    AND SUBSTR(`date`, 1, 7) = "2020-08"
GROUP BY `ad_id`

I have indexes on ad_id and ad_event + date but it does not help much.

How can I optimize this query? The database will grow to billions of entries and more.

@edit

Forgot to mention that the code above is inner part of outer query:

    SELECT
        `ad_network`,
        SUM(`AdImpression`) as cnt_AdImpression,
        SUM(`AdClickThru`) as cnt_AdClickThru,
        100 * SUM(`AdClickThru`) / SUM(`AdImpression`) as ctr
    FROM (
        SELECT
            `ad_network`,
            `ad_id`,
            SUM(DISTINCT CASE WHEN `ad_event` = "AdImpression" THEN 1 ELSE 0 END) as AdImpression,
            SUM(DISTINCT CASE WHEN `ad_event` = "AdClickThru" THEN 1 ELSE 0 END) as AdClickThru
        FROM `ads`
        WHERE 1
            AND `ad_event` IN ("AdImpression", "AdClickThru")
            AND SUBSTR(`date`, 1, 7) = "2020-08" -- better performance
        GROUP BY `ad_id`
    ) a
    GROUP BY `ad_network`
    ORDER BY ctr DESC

The problem is that one user can click many times, so I must count only one AdClickThru.

Then use MAX() , not COUNT(DISTINCT) . This gives the same result as your expression, and is much more efficient. I would also recommend rewriting the date filter so it is index-friendly:

SELECT
    `ad_network`,
    `ad_id`,
    MAX(`ad_event` = 'AdImpression') as AdImpression,
    MAX(`ad_event` = 'AdClickThru') as AdClickThru
FROM `ads`
WHERE 1
    AND `ad_event` IN ('AdImpression', 'AdClickThru')
    AND `date` >= '2020-08-01'
    AND `date` <  '2020-09-01'
GROUP BY `ad_id`

Notes:

  • the presence of ad_network in the select clause is hitching me: if there are several values per ad_id , it is undefined which will be picked. Either put this column in the group by clause as well, or use an aggregate function in the sélect clause (such as MAX(ad_network) - or if you are ok with an arbitrary value, then be explicit about it with any_value()

  • use single quotes for literal strings rather than double quotes (this is the SQL standard)

There is no need for 2 separate aggregations in the main query and the subquery.
You want to count the distinct ad_id s for each of the 2 cases:

SELECT ad_network,
       COUNT(DISTINCT CASE WHEN ad_event = 'AdImpression' THEN ad_id END) AS cnt_AdImpression,
       COUNT(DISTINCT CASE WHEN ad_event = 'AdClickThru' THEN ad_id END) AS cnt_AdClickThru,
       100 * 
       COUNT(DISTINCT CASE WHEN ad_event = 'AdClickThru' THEN ad_id END) / 
       COUNT(DISTINCT CASE WHEN ad_event = 'AdImpression' THEN ad_id END) AS ctr
FROM ads
WHERE ad_event IN ('AdImpression', 'AdClickThru') AND SUBSTR(date, 1, 7) = '2020-08'
GROUP BY ad_network
ORDER BY ctr DESC

The problem here is that you have to repeat the expressions for cnt_AdImpression and cnt_AdClickThru .
You can calculate these expressions in a subquery:

SELECT ad_network, cnt_AdImpression, cnt_AdClickThru,
       100 * cnt_AdClickThru / cnt_AdImpression AS ctr
FROM (
  SELECT ad_network,
         COUNT(DISTINCT CASE WHEN ad_event = 'AdImpression' THEN ad_id END) AS cnt_AdImpression,
         COUNT(DISTINCT CASE WHEN ad_event = 'AdClickThru' THEN ad_id END) AS cnt_AdClickThru
  FROM ads
  WHERE ad_event IN ('AdImpression', 'AdClickThru') AND SUBSTR(date, 1, 7) = '2020-08'
  GROUP BY ad_network
) t
ORDER BY ctr DESC

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM