简体   繁体   中英

MySQL NOT IN (array[]) vs PHP in_array(array[])?

I have a query that I run hourly and I am processing a certain dataset from this query. While processing this dataset, I need to ignore some IDs, I am currently doing this with NOT IN , but the number of IDs I need to ignore is around 50.

The question I am wondering is, I am creating a text file in a certain pattern with the data I am processing, should I use this ignore operation directly in the query or inside the foreach pattern for better performance?

Query returns around 5000-7000 data in a dataset consists of 10M record, and I need to ignore around 50 ID from resultset.

Lets say;

$blacklist_arr = array(1,10,20,30,40,50,60,70,80,90,100); //around 50 element in array~

What I use now;

...QUERY...
resultSet.ID NOT IN (\'' . implode( "', '" , $blacklist_arr ) . '\')

What I'm planning to use;

foreach ($final_dataset as $final_data) {
    ...
    if (!in_array($final_data, $blacklist_arr )) {
    //write to file
    ...

edit * Query structure is below;

SELECT * 
FROM
    (
        (
        SELECT DISTINCT a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a`
            INNER JOIN ( SELECT MAX( b.col4 ) AS X, b.col2 FROM `a` AS `b` GROUP BY b.col2 ORDER BY NULL ) sub ON ( sub.X = a.col4 ) 
        WHERE
            ( a.someColumn > NOW( ) - INTERVAL 2 HOUR ) 
            AND ( a.col3 < DATE_HERE ) 
        ) UNION
        (
        SELECT  a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a` 
        WHERE
            ( a.someColumn >= DATE_SUB( NOW( ), INTERVAL 3 MONTH ) AND a.col4 IS NULL ) 
            AND ( a.col3 < DATE_HERE ) 
        ) 
    ) AS resultSet 
WHERE
    resultSet.col1 NOT IN ( 1,10,20,30,40,50,60,70,80,90,100 ) 
ORDER BY
    resultSet.col3 ASC,
    resultSet.col2 ASC,
    resultSet.col4 ASC,
    resultSet.col1 DESC

If your t.col_black_elem is obtained by another query you could try using a left join an check for not matching value

   SELECT a.col1,..., a.coln
   from table1 a 
   LEFT JOIN (
        select col_black_elem from tablex 
   ) t on t.col_black_elem = a.colx 
   WHERE  t.col_black_elem  is null

and for youe code

SELECT * 
FROM
    (
        (
        SELECT DISTINCT a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a`
            INNER JOIN ( SELECT MAX( b.col4 ) AS X, b.col2 FROM `a` AS `b` GROUP BY b.col2 ORDER BY NULL ) sub ON ( sub.X = a.col4 ) 
        WHERE
            ( a.someColumn > NOW( ) - INTERVAL 2 HOUR ) 
            AND ( a.col3 < DATE_HERE ) 
        ) UNION
        (
        SELECT  a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a` 
        WHERE
            ( a.someColumn >= DATE_SUB( NOW( ), INTERVAL 3 MONTH ) AND a.col4 IS NULL ) 
            AND ( a.col3 < DATE_HERE ) 
        ) 
    ) AS resultSet 
LEFT JOIN (
        select col_black_elem from tablex 
   ) t on t.col_black_elem =  resultSet.col1
WHERE  t.col_black_elem  is null
ORDER BY
    resultSet.col3 ASC,
    resultSet.col2 ASC,
    resultSet.col4 ASC,
    resultSet.col1 DESC

Otherwise if your t.col_black_elem is not obtained by a nother query you could popolate a temp table ora buld dinamycally a temp table using several select union

From the performance looking point I recomment you:

  1. Remove DISTINCT in 1st subquery. One sorting is better then two sortings.
  2. Filter your rows in subqueries, not in combined rowset, this will decrease the amount of rows to be sorted by UNION.
SELECT * 
FROM
    (
        (
        SELECT a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a`
            INNER JOIN ( SELECT MAX( b.col4 ) AS X, b.col2 FROM `a` AS `b` GROUP BY b.col2 ORDER BY NULL ) sub ON ( sub.X = a.col4 ) 
        WHERE
            ( a.someColumn > NOW( ) - INTERVAL 2 HOUR ) 
            AND ( a.col3 < DATE_HERE ) 
            AND a.col1 NOT IN ( 1,10,20,30,40,50,60,70,80,90,100 ) 

        ) UNION
        (
        SELECT  a.col1, a.col2, a.col3, a.col4,..., a.coln
        FROM
            `a` 
        WHERE
            ( a.someColumn >= DATE_SUB( NOW( ), INTERVAL 3 MONTH ) AND a.col4 IS NULL ) 
            AND ( a.col3 < DATE_HERE ) 
            AND a.col1 NOT IN ( 1,10,20,30,40,50,60,70,80,90,100 ) 

        ) 
    ) AS resultSet 
ORDER BY
    resultSet.col3 ASC,
    resultSet.col2 ASC,
    resultSet.col4 ASC,
    resultSet.col1 DESC

A variety of points:

  • I have a "Rule of Thumb": "If a possible optimization is estimated to improve things by less than 10%, move on. That is, don't spend extra effort on it. Instead, look for something better to work on." According to your numbers, the optimization decreases the result set by only about 1%.

  • There is a standard programming rule: "KISS". Which is simpler to code -- the NOT IN or the PHP filtering? A variant: "Which approach is fewer keystrokes?" That comes from "A Programmer's time is much more valuable than computer time.

  • Moving the NOT IN into each subquery may speed it up slightly . This is because it would decrease (slightly) the intermediate tables involved in the query. (However, this fails the 10% and KISS rules.) On the other hand, it could eliminate the outermost Select. Note: This works: (SELECT...) UNION (SELECT...) ORDER BY... .

  • Potential bug: The innermost Select may be picking a date & time from one of the excluded col1's.

  • UNION defaults to UNION DISTINCT , which is slower than UNION ALL . Consider this as a bigger optimization.

  • ON ( sub.X = a.col4) probably needs to mention col2 .

  • Is DATE_HERE somehow related to NOW() ? Perhaps you need TIMESTAMP instead of DATETIME or vice versa?

  • I suspect that the DISTINCT is not needed. Anyway, it is redundant with the UNION .

  • Consider whether the "blacklist" should be a table, not a config file. As a table, NOT EXISTS(..) or LEFT JOIN.. IS NOT NULL would need to be added to the query. This would be slower than what you have now but might be "cleaner".

  • WHERE 1=1 is an artifact of lazy programming; it is not an optimization; the Optimizer will simply toss it.

  • Often, better indexes provide the most improvement. Maybe the following would help. Note: Separate, single-column indexes are not as good. Also, when adding INDEX(a,b) , drop INDEX(a) .

     a (as b): INDEX(col2, col4) -- this order a: INDEX(col4, col3, someColumn) -- col4 first

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM