[英]MySQL : Dependent Sub Query with NOT IN in the WHERE clause is very slow
I am auditing user details from my application using open Id login .If a first time a user is login a OPEN ID we consider as signup .我正在使用 open Id login 审核我的应用程序中的用户详细信息。如果用户第一次登录 OPEN ID,我们将其视为注册。 I am generating audit signin report using this details .
我正在使用此详细信息生成审核登录报告。 Sample Table Data.
示例表数据。
+---------+----------+-----------+---------------+
| USER_ID | PROVIDER | OPERATION | TIMESTAMP |
+---------+----------+-----------+---------------+
| 120 | Google | SIGN_UP | 1347296347000 |
| 120 | Google | SIGN_IN | 1347296347000 |
| 121 | Yahoo | SIGN_IN | 1347296347000 |
| 122 | Yahoo | SIGN_IN | 1347296347000 |
| 120 | Google | SIGN_UP | 1347296347000 |
| 120 | FaceBook | SIGN_IN | 1347296347000 |
+---------+----------+-----------+---------------+
In this table I want to exclude already SIGN_UP ed " SIGN_IN " ed user count based on provider .在这个表中,我想根据提供者排除已经SIGN_UP ed " SIGN_IN " ed 的用户数。
Show Create table显示创建表
CREATE TABLE `signin_details` (
`USER_ID` int(11) DEFAULT NULL,
`PROVIDER` char(40) DEFAULT NULL,
`OPERATION` char(40) DEFAULT NULL,
`TIMESTAMP` bigint(20) DEFAULT NULL
) ENGINE=InnoDB
I am using this query .我正在使用这个查询。
select
count(distinct(USER_ID)) as signin_count,
PROVIDER from signin_details s1
where
s1.USER_ID NOT IN
(
select
USER_ID
from signin_details
where
signin_details.PROVIDER=s1.PROVIDER
and signin_details.OPERATION='SIGN_UP'
and signin_details.TIMESTAMP/1000 BETWEEN UNIX_TIMESTAMP(CURRENT_DATE()-INTERVAL 1 DAY) * 1000 AND UNIX_TIMESTAMP(CURRENT_DATE()) * 1000
)
AND OPERATION='SIGN_IN' group by PROVIDER;
Explain Output:解释输出:
+----+--------------------+----------------+------+---------------+------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------------+------+---------------+------+---------+------+------+-----------------------------+
| 1 | PRIMARY | s1 | ALL | NULL | NULL | NULL | NULL | 6 | Using where; Using filesort |
| 2 | DEPENDENT SUBQUERY | signin_details | ALL | NULL | NULL | NULL | NULL | 6 | Using where |
+----+--------------------+----------------+------+---------------+------+---------+------+------+-----------------------------+
Query Output :查询输出:
+--------------+----------+
| signin_count | PROVIDER |
+--------------+----------+
| 1 | FaceBook |
| 2 | Yahoo |
+--------------+----------+
It takes more than 40 minutes to execute for 200k rows.执行 20 万行需要 40 多分钟。
My assumption is it will check each row with total number of dependant subquery output.我的假设是它将检查每一行与依赖子查询输出的总数。
My Assumption on this query.我对这个查询的假设。
A -> Dependant Outputs (B,C,D) .
A check with B
A check with C
A check with D
If dependant query output is larger it will take so long time to execute.如果相关查询输出较大,则执行时间会很长。 How to improve this query?
如何改进这个查询?
If you use MySQL you have to know that sub queries performs awful slow.如果您使用MySQL,您必须知道子查询的执行速度非常慢。
IN
is slow... IN
很慢...
EXISTS
is often faster then IN
EXISTS
通常比IN
更快
JOIN
is mostly the fastest way do things like this. JOIN
主要是做这样的事情的最快方式。
SELECT DISTINCT
s1.PROVIDER,
COUNT(DISTINCT s1.USER_ID)
FROM
signin_details s1
LEFT JOIN
(
SELECT DISTINCT
USER_ID, PROVIDER
FROM
signin_details
WHERE
signin_details.OPERATION='SIGN_UP'
AND
signin_details.TIMESTAMP
BETWEEN
UNIX_TIMESTAMP(CURRENT_DATE()-INTERVAL 1 DAY) * 1000
AND UNIX_TIMESTAMP(CURRENT_DATE()) * 1000
) AS t USING (USER_ID, PROVIDER)
WHERE
t.USER_ID IS NULL
AND OPERATION='SIGN_IN'
GROUP BY s1.PROVIDER
http://sqlfiddle.com/#!2/122ac/12 http://sqlfiddle.com/#!2/122ac/12
NOTE: If you wonder about the sqlfiddle result consider here is a UNIX_TIMESTAMP
in the query.注意:如果您想知道 sqlfiddle 结果,请考虑这里是查询中的
UNIX_TIMESTAMP
。
Result:结果:
| PROVIDER | COUNT(DISTINCT S1.USER_ID) |
-----------------------------------------
| FaceBook | 1 |
| Yahoo | 2 |
MySQL and the INTERSECT
story. MySQL 和
INTERSECT
故事。 You get all combinations of USER_ID
and PROVIDER
which you don't want to count.您将获得不想计算的
USER_ID
和PROVIDER
所有组合。 Then LEFT JOIN
them to your data.然后
LEFT JOIN
它们到您的数据。 Now all the rows you want to count have no values from the LEFT JOIN
.现在,您要计算的所有行都没有来自
LEFT JOIN
值。 You get them by t.USER_ID IS NULL
.您可以通过
t.USER_ID IS NULL
获取它们。
Input:输入:
| rn° | USER_ID | PROVIDER | OPERATION | TIMESTAMP |
-------------------------------------------------------
| 1 | 120 | Google | SIGN_UP | 1347296347000 | -
| 2 | 120 | Google | SIGN_IN | 1347296347000 | - (see rn° 1)
| 3 | 121 | Yahoo | SIGN_IN | 1347296347000 | Y
| 4 | 122 | Yahoo | SIGN_IN | 1347296347000 | Y
| 5 | 120 | Google | SIGN_UP | 1347296347000 | -
| 6 | 120 | FaceBook | SIGN_IN | 1347296347000 | F
| 7 | 119 | FaceBook | SIGN_IN | 1347296347000 | - (see rn° 8)
| 8 | 119 | FaceBook | SIGN_UP | 1347296347000 | -
Use "NOT IN" inside the HAVING clause.
在 HAVING 子句中使用“NOT IN”。 it will be faster than "where not in"
它会比“不在的地方”更快
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.