I've got a table that has id
, date
, ad_id
, ad_network
, ad_event
columns. In my database there are millions of distinct ad_id
each has a few events associated with them. When I try to use GROUP BY
on the ad_id
to count each event it takes so long there is 503 error.
I need to count distinct AdClickThru
and AdImpression
so that I can calculate the CTR. The problem is that one user can click many times, so I must count only one AdClickThru
.
The query is below:
SELECT
`ad_network`,
`ad_id`,
SUM(DISTINCT CASE WHEN `ad_event` = "AdImpression" THEN 1 ELSE 0 END) as AdImpression,
SUM(DISTINCT CASE WHEN `ad_event` = "AdClickThru" THEN 1 ELSE 0 END) as AdClickThru
FROM `ads`
WHERE 1
AND `ad_event` IN ("AdImpression", "AdClickThru")
AND SUBSTR(`date`, 1, 7) = "2020-08"
GROUP BY `ad_id`
I have indexes on ad_id
and ad_event + date
but it does not help much.
How can I optimize this query? The database will grow to billions of entries and more.
@edit
Forgot to mention that the code above is inner part of outer query:
SELECT
`ad_network`,
SUM(`AdImpression`) as cnt_AdImpression,
SUM(`AdClickThru`) as cnt_AdClickThru,
100 * SUM(`AdClickThru`) / SUM(`AdImpression`) as ctr
FROM (
SELECT
`ad_network`,
`ad_id`,
SUM(DISTINCT CASE WHEN `ad_event` = "AdImpression" THEN 1 ELSE 0 END) as AdImpression,
SUM(DISTINCT CASE WHEN `ad_event` = "AdClickThru" THEN 1 ELSE 0 END) as AdClickThru
FROM `ads`
WHERE 1
AND `ad_event` IN ("AdImpression", "AdClickThru")
AND SUBSTR(`date`, 1, 7) = "2020-08" -- better performance
GROUP BY `ad_id`
) a
GROUP BY `ad_network`
ORDER BY ctr DESC
The problem is that one user can click many times, so I must count only one AdClickThru.
Then use MAX()
, not COUNT(DISTINCT)
. This gives the same result as your expression, and is much more efficient. I would also recommend rewriting the date filter so it is index-friendly:
SELECT
`ad_network`,
`ad_id`,
MAX(`ad_event` = 'AdImpression') as AdImpression,
MAX(`ad_event` = 'AdClickThru') as AdClickThru
FROM `ads`
WHERE 1
AND `ad_event` IN ('AdImpression', 'AdClickThru')
AND `date` >= '2020-08-01'
AND `date` < '2020-09-01'
GROUP BY `ad_id`
Notes:
the presence of ad_network
in the select
clause is hitching me: if there are several values per ad_id
, it is undefined which will be picked. Either put this column in the group by
clause as well, or use an aggregate function in the sélect
clause (such as MAX(ad_network)
- or if you are ok with an arbitrary value, then be explicit about it with any_value()
use single quotes for literal strings rather than double quotes (this is the SQL standard)
There is no need for 2 separate aggregations in the main query and the subquery.
You want to count the distinct ad_id
s for each of the 2 cases:
SELECT ad_network,
COUNT(DISTINCT CASE WHEN ad_event = 'AdImpression' THEN ad_id END) AS cnt_AdImpression,
COUNT(DISTINCT CASE WHEN ad_event = 'AdClickThru' THEN ad_id END) AS cnt_AdClickThru,
100 *
COUNT(DISTINCT CASE WHEN ad_event = 'AdClickThru' THEN ad_id END) /
COUNT(DISTINCT CASE WHEN ad_event = 'AdImpression' THEN ad_id END) AS ctr
FROM ads
WHERE ad_event IN ('AdImpression', 'AdClickThru') AND SUBSTR(date, 1, 7) = '2020-08'
GROUP BY ad_network
ORDER BY ctr DESC
The problem here is that you have to repeat the expressions for cnt_AdImpression
and cnt_AdClickThru
.
You can calculate these expressions in a subquery:
SELECT ad_network, cnt_AdImpression, cnt_AdClickThru,
100 * cnt_AdClickThru / cnt_AdImpression AS ctr
FROM (
SELECT ad_network,
COUNT(DISTINCT CASE WHEN ad_event = 'AdImpression' THEN ad_id END) AS cnt_AdImpression,
COUNT(DISTINCT CASE WHEN ad_event = 'AdClickThru' THEN ad_id END) AS cnt_AdClickThru
FROM ads
WHERE ad_event IN ('AdImpression', 'AdClickThru') AND SUBSTR(date, 1, 7) = '2020-08'
GROUP BY ad_network
) t
ORDER BY ctr DESC
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.