Can this query or table schema be optimized?

Question

I am running this procedure a few million times, and although each time it takes a few ms, eventually it takes a couple of weeks to run all of them. I was wondering if anyone could help me optimizing or improving its performance . Any improvement might save days!

CREATE PROCEDURE process_parameters(IN parameter1 VARCHAR(128), IN parameter2 VARCHAR(128), IN combination_type CHAR(1))
BEGIN

        SET @parameter1_id := NULL, @parameter2_id := NULL;
        SET @parameter1_hash := "", @parameter2_hash := "";

        IF parameter1 IS NOT NULL THEN

                SET @parameter_hash := parameter1;
                INSERT IGNORE INTO `collection1` (`parameter`) VALUES (parameter1);
                SET @parameter1_id := (SELECT `id` FROM `collection1` WHERE `parameter` = parameter1);

        END IF;

        IF parameter2 IS NOT NULL THEN

                SET @parameter2_hash := parameter2;
                INSERT IGNORE INTO `collection2` (`parameter`) VALUES (parameter2);
                SET @parameter2_id := (SELECT `id` FROM `collection2` WHERE `parameter` = parameter2);

        END IF;

        SET @hash := MD5(CONCAT(@parameter1_hash, @parameter2_hash));
        INSERT IGNORE INTO `combinations` (`hash`,`type`,`parameter1`,`parameter2`) VALUES (@hash, combination_type, @parameter1_id, @parameter2_id);

END

The logic behind of it is: I store unique combinations of (parameter1, parameter2) in combinations , where parameter1 or paramter2 can be NULL (but never both at the same time). I store a type in combinations to know later which parameter has value. To ensure that a combination is unique I added an MD5 field (a primary key (parameter1,parameter2) will not work because of comparison with NULL always returns NULL ). Each parameter has a separate table ( collection1 and collection2 respectively) to store their unique id . There are hundreds/thousands of unique parameter1 and parameter2 , but their combinations are highly repeated and are much below the cardinal multiplication.

As an example, ("A", "1") , ("A", "2") , ("B", "1") , ("A", "1") , ("A", NULL) , (NULL, "2") would yield:

`collection1` (`id`, `parameter`)
1, "A"
2, "B"

`collection2` (`id`, `parameter`)
1, "1"
2, "2"

`combinations` (`type`, `parameter1`, `parameter2`)
"P1andP2", 1, 1,
"P1andP2", 1, 2,
"P1andP2", 2, 1,
"P1Only",  1, NULL
"P2Only",  NULL, 2

These are the definitions of the tables:

DESCRIBE `combinations`;
+-------------+-----------------------------------+------+-----+---------+----------------+
| Field       | Type                              | Null | Key | Default | Extra          |
+-------------+-----------------------------------+------+-----+---------+----------------+
| combination | int(11)                           | NO   | PRI | NULL    | auto_increment |
| hash        | char(32)                          | NO   | UNI | NULL    |                |
| type        | enum('P1andP2','P1Only','P2Only') | NO   |     | NULL    |                |
| parameter1  | int(11)                           | YES  |     | NULL    |                |
| parameter2  | int(11)                           | YES  |     | NULL    |                |
+-------------+-----------------------------------+------+-----+---------+----------------+

DESCRIBE `collection1`; (`collection2` is identical)
+-----------+--------------+------+-----+---------+----------------+
| Field     | Type         | Null | Key | Default | Extra          |
+-----------+--------------+------+-----+---------+----------------+
| id        | int(11)      | NO   | PRI | NULL    | auto_increment |
| parameter | varchar(255) | NO   | UNI | NULL    |                |
+-----------+--------------+------+-----+---------+----------------+

Any help will be appreciated!

Answer 1

Please use SHOW CREATE TABLE ; it is more descriptive than DESCRIBE .

Use LAST_INSERT_ID()

 SET @parameter1_id := (SELECT `id` FROM `collection1`
                          WHERE `parameter` = parameter1);

can be replaced by

 SELECT @parameter1_id := LAST_INSERT_ID();

It will avoid a round trip to the server.

Oops... The OP points out that the id won't be returned if the row is a dup. This is a workaround that might run faster:

INSERT INTO `collection1` (`parameter`)
        VALUES (parameter1)
    ON DUPLICATE KEY UPDATE
        id = LAST_INSERT_ID(id);
SELECT @parameter1 := LAST_INSERT_ID(id);

It's a kludgy trick that is documented somewhere in the documentation. But; more below...

Shrink table

Do you really need combination ? You have another UNIQUE key that could be used as the PRIMARY KEY . This might cut in half the time taken for the final INSERT .
This may (or may not) speed things up, but only because the row size shrinks: Instead of storing the md5 into CHAR(32) , store UNHEX(md5) into BINARY(16) .

Batch INSERT

Can you gather a bunch of these to INSERT at once? If you gather 1000 rows and string them into a single INSERT (actually 3 INSERTs , since 3 tables are involved), it will run literally 10 times as fast.

Because of needing the ids, it gets more complicated. You would need to batch things into collection1 and collection2 ; then work on combinations .

Since the "combination*" tables are essentially "normalization", see my discussion of how to batch them very efficiently: http://mysql.rjweb.org/doc.php/staging_table#normalization It involves 2 statements, one to insert new rows, the other to grab all the ids for the batch.

COALESCE

Get rid of @parameter*_hash and @hash completely. Change the use of @hash call to:

INSERT IGNORE INTO combinations (...) VALUES
    ( CONCAT(COALESCE(parameter1,''), COALESCE(parameter2, '')),
     ...)

Think of it this way... Each statement takes a non-trivial amount of time. (This shows up significantly in batching of inserts.) I'm getting rid of 4 statements at some expense due to adding complexity to one statement.

Settings

The most important might be innodb_flush_log_at_trx_commit = 2 .

3 Streams

Write 3 procedures, each one with the code simplified to the particular type . Combining this with batching should further speed things up.

Potential issues

I think these two will get the same hash . Hence, only one row for these two:
```
 ("xyz", NULL) (NULL, "xyz")
```
Be aware that INSERT IGNORE will burn ids if there is already a row with the given unique key. Because of this, keep an eye on running out of values with INT (only 2 billion). Changing to INT UNSIGNED would up it to 4B, still in 4 bytes.

Can this query or table schema be optimized?

Question

1 answers

solution1
0 ACCPTED 2020-08-19 05:33:51

Can this query or table schema be optimized?

Question

1 answers

solution1 0 ACCPTED 2020-08-19 05:33:51

solution1
0 ACCPTED 2020-08-19 05:33:51