简体   繁体   中英

Is there a way to make an SQL NOT IN query faster?

I want to get the number of unique mobile phone entries per day that have been logged to a database and have never appeared in the log. I thought it was a trivial query but shock when the query took 10 minutes on a table with about 900K entries. A sample Select is getting the number of unique mobile phones that were logged on the 9th of April 2015 and had never been logged before. Its like getting who are the truly new visitors to you site on a specific day. SQL Fiddle Link

SELECT COUNT(DISTINCT mobile_number)
FROM log_entries
WHERE created_at BETWEEN '2015-04-09 00:00:00'
    AND '2015-04-09 23:59:59'
    AND mobile_number NOT IN (
        SELECT mobile_number
        FROM log_entries
        WHERE created_at < '2015-04-09 00:00:00'
        )

I have individual indexes on created_at and on mobile_number .

Is there a way to make it faster? I see a very similar question here on SO but that was working with two tables.

A NOT IN can be rewritten as a NOT EXISTS query which is very often faster (unfortunately the Postgres optimizer isn't smart enough to detect this).

SELECT COUNT(DISTINCT l1.mobile_number) 
FROM log_entries as l1
WHERE l1.created_at >= '2015-04-09 00:00:00' 
  AND l1.created_at <= '2015-04-09 23:59:59' 
  AND NOT EXISTS (SELECT * 
                  FROM log_entries l2
                  WHERE l2.created_at < '2015-04-09 00:00:00'
                    AND l2.mobile_number = l1.mobile_number);

An index on (mobile_number, created_at) should further improve the performance.


A side note: created_at <= '2015-04-09 23:59:59' will not include rows with fractional seconds, eg 2015-04-09 23:59:59.789 . When dealing with timestamps it's better to use a "lower than" with the "next day" instead of a "lower or equal" with the day in question.

So better use: created_at < '2015-04-10 00:00:00' instead to also "catch" rows on that day with fractional seconds.

I tend to suggest transforming NOT IN into a left anti-join (ie a left join that only keeps the left rows that do not match the right side). It's complicated somewhat in this case by the fact that it's a self join against two distinct ranges of the same table, so you're really joining two subqueries:

SELECT COUNT(n.mobile_number)
FROM (
  SELECT DISTINCT mobile_number
  FROM log_entries
  WHERE created_at BETWEEN '2015-04-09 00:00:00' AND '2015-04-09 23:59:59'
) n
LEFT OUTER JOIN (
  SELECT DISTINCT mobile_number
  FROM log_entries
  WHERE created_at < '2015-04-09 00:00:00'
) o ON (n.mobile_number = o.mobile_number)
WHERE o.mobile_number IS NULL;

I'd be interested in the performance of this as compared with the typical NOT EXISTS formulation provided by @a_horse_with_no_name.

Note that I've also pushed the DISTINCT check down into the subquery.

Your query seems to be "how many newly seen mobile numbers are there in <time range>". Right?

Isn't WHERE created_at >= '2015-04-09 00:00:00' AND created_at <= '2015-04-09 23:59:59' taking care of WHERE created_at < '2015-04-09 00:00:00'? Am I missing something here?

NOT IN isn't fast at all. And your subquery returns a lot of repeating records. Maybe you should put unique numbers to dedicated table (because GROUP BY will be slow too).

Try something like this:

SELECT mobile_number, min(created_at)
FROM log_entries
GROUP BY mobile_number
HAVING min(created_at) between '2015-04-09 00:00:00' and '2015-04-09 23:59:59'

Adding a single index covering both mobile_number and created_at will improve performance slightly, assuming that there are other columns in the table, as only that index will need to be scanned.

Try use WITH(if your sql support it). Here is help(postgres): http://www.postgresql.org/docs/current/static/queries-with.html

And your query should looks like that:

WITH  b as
(SELECT distinct mobile_number
        FROM log_entries
        WHERE created_at < '2015-04-09 00:00:00') 
SELECT COUNT(DISTINCT a.mobile_number)
FROM log_entries a   
left join b using(mobile_number)
where created_at >= '2015-04-09 00:00:00'
   AND created_at <= '2015-04-09 23:59:59' and b.mobile_number is null;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM