简体   繁体   中英

MySQL : Dependent Sub Query with NOT IN in the WHERE clause is very slow

I am auditing user details from my application using open Id login .If a first time a user is login a OPEN ID we consider as signup . I am generating audit signin report using this details . Sample Table Data.

+---------+----------+-----------+---------------+
| USER_ID | PROVIDER | OPERATION | TIMESTAMP     |
+---------+----------+-----------+---------------+
|     120 | Google   | SIGN_UP   | 1347296347000 |
|     120 | Google   | SIGN_IN   | 1347296347000 |
|     121 | Yahoo    | SIGN_IN   | 1347296347000 |
|     122 | Yahoo    | SIGN_IN   | 1347296347000 |
|     120 | Google   | SIGN_UP   | 1347296347000 |
|     120 | FaceBook | SIGN_IN   | 1347296347000 |
+---------+----------+-----------+---------------+

In this table I want to exclude already SIGN_UP ed " SIGN_IN " ed user count based on provider .

Show Create table

CREATE TABLE `signin_details` (
  `USER_ID` int(11) DEFAULT NULL,
  `PROVIDER` char(40) DEFAULT NULL,
  `OPERATION` char(40) DEFAULT NULL,
  `TIMESTAMP` bigint(20) DEFAULT NULL
) ENGINE=InnoDB

I am using this query .

select 
  count(distinct(USER_ID)) as signin_count, 
  PROVIDER from signin_details s1 
where 
  s1.USER_ID NOT IN 
  (
    select 
      USER_ID 
    from signin_details 
    where 
      signin_details.PROVIDER=s1.PROVIDER 
      and signin_details.OPERATION='SIGN_UP' 
      and signin_details.TIMESTAMP/1000 BETWEEN UNIX_TIMESTAMP(CURRENT_DATE()-INTERVAL 1 DAY) * 1000 AND UNIX_TIMESTAMP(CURRENT_DATE()) * 1000
  )  
  AND OPERATION='SIGN_IN' group by PROVIDER;

Explain Output:

+----+--------------------+----------------+------+---------------+------+---------+------+------+-----------------------------+
| id | select_type        | table          | type | possible_keys | key  | key_len | ref  | rows | Extra                       |
+----+--------------------+----------------+------+---------------+------+---------+------+------+-----------------------------+
|  1 | PRIMARY            | s1             | ALL  | NULL          | NULL | NULL    | NULL |    6 | Using where; Using filesort |
|  2 | DEPENDENT SUBQUERY | signin_details | ALL  | NULL          | NULL | NULL    | NULL |    6 | Using where                 |
+----+--------------------+----------------+------+---------------+------+---------+------+------+-----------------------------+

Query Output :

+--------------+----------+
| signin_count | PROVIDER |
+--------------+----------+
|            1 | FaceBook |
|            2 | Yahoo    |
+--------------+----------+

It takes more than 40 minutes to execute for 200k rows.

My assumption is it will check each row with total number of dependant subquery output.

My Assumption on this query.

 A -> Dependant Outputs (B,C,D) .
 A check with B
 A check with C
 A check with D

If dependant query output is larger it will take so long time to execute. How to improve this query?

If you use MySQL you have to know that sub queries performs awful slow.

IN is slow...

EXISTS is often faster then IN

JOIN is mostly the fastest way do things like this.

SELECT DISTINCT
  s1.PROVIDER,
  COUNT(DISTINCT s1.USER_ID)

FROM 
  signin_details s1
  LEFT JOIN 
  (
    SELECT DISTINCT
      USER_ID, PROVIDER
    FROM 
      signin_details 
    WHERE
      signin_details.OPERATION='SIGN_UP' 
      AND 
        signin_details.TIMESTAMP 
          BETWEEN 
            UNIX_TIMESTAMP(CURRENT_DATE()-INTERVAL 1 DAY) * 1000 
            AND UNIX_TIMESTAMP(CURRENT_DATE()) * 1000
  ) AS t USING  (USER_ID, PROVIDER)

WHERE
  t.USER_ID IS NULL
  AND OPERATION='SIGN_IN'
GROUP BY s1.PROVIDER

http://sqlfiddle.com/#!2/122ac/12

NOTE: If you wonder about the sqlfiddle result consider here is a UNIX_TIMESTAMP in the query.

Result:

| PROVIDER | COUNT(DISTINCT S1.USER_ID) |
-----------------------------------------
| FaceBook |                          1 |
|    Yahoo |                          2 |

MySQL and the INTERSECT story. You get all combinations of USER_ID and PROVIDER which you don't want to count. Then LEFT JOIN them to your data. Now all the rows you want to count have no values from the LEFT JOIN . You get them by t.USER_ID IS NULL .


Input:

| rn° | USER_ID | PROVIDER | OPERATION |     TIMESTAMP |
-------------------------------------------------------
| 1   |     120 |   Google |   SIGN_UP | 1347296347000 | -
| 2   |     120 |   Google |   SIGN_IN | 1347296347000 | - (see rn° 1)
| 3   |     121 |    Yahoo |   SIGN_IN | 1347296347000 | Y
| 4   |     122 |    Yahoo |   SIGN_IN | 1347296347000 | Y
| 5   |     120 |   Google |   SIGN_UP | 1347296347000 | -
| 6   |     120 | FaceBook |   SIGN_IN | 1347296347000 | F
| 7   |     119 | FaceBook |   SIGN_IN | 1347296347000 | - (see rn° 8)
| 8   |     119 | FaceBook |   SIGN_UP | 1347296347000 | -

Use "NOT IN" inside the HAVING clause. it will be faster than "where not in"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM