I am running this procedure a few million times, and although each time it takes a few ms, eventually it takes a couple of weeks to run all of them. I was wondering if anyone could help me optimizing or improving its performance . Any improvement might save days!
CREATE PROCEDURE process_parameters(IN parameter1 VARCHAR(128), IN parameter2 VARCHAR(128), IN combination_type CHAR(1))
BEGIN
SET @parameter1_id := NULL, @parameter2_id := NULL;
SET @parameter1_hash := "", @parameter2_hash := "";
IF parameter1 IS NOT NULL THEN
SET @parameter_hash := parameter1;
INSERT IGNORE INTO `collection1` (`parameter`) VALUES (parameter1);
SET @parameter1_id := (SELECT `id` FROM `collection1` WHERE `parameter` = parameter1);
END IF;
IF parameter2 IS NOT NULL THEN
SET @parameter2_hash := parameter2;
INSERT IGNORE INTO `collection2` (`parameter`) VALUES (parameter2);
SET @parameter2_id := (SELECT `id` FROM `collection2` WHERE `parameter` = parameter2);
END IF;
SET @hash := MD5(CONCAT(@parameter1_hash, @parameter2_hash));
INSERT IGNORE INTO `combinations` (`hash`,`type`,`parameter1`,`parameter2`) VALUES (@hash, combination_type, @parameter1_id, @parameter2_id);
END
The logic behind of it is: I store unique combinations of (parameter1, parameter2)
in combinations
, where parameter1
or paramter2
can be NULL
(but never both at the same time). I store a type
in combinations
to know later which parameter
has value. To ensure that a combination is unique I added an MD5 field (a primary key (parameter1,parameter2)
will not work because of comparison with NULL
always returns NULL
). Each parameter
has a separate table ( collection1
and collection2
respectively) to store their unique id
. There are hundreds/thousands of unique parameter1
and parameter2
, but their combinations are highly repeated and are much below the cardinal multiplication.
As an example, ("A", "1")
, ("A", "2")
, ("B", "1")
, ("A", "1")
, ("A", NULL)
, (NULL, "2")
would yield:
`collection1` (`id`, `parameter`)
1, "A"
2, "B"
`collection2` (`id`, `parameter`)
1, "1"
2, "2"
`combinations` (`type`, `parameter1`, `parameter2`)
"P1andP2", 1, 1,
"P1andP2", 1, 2,
"P1andP2", 2, 1,
"P1Only", 1, NULL
"P2Only", NULL, 2
These are the definitions of the tables:
DESCRIBE `combinations`;
+-------------+-----------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-----------------------------------+------+-----+---------+----------------+
| combination | int(11) | NO | PRI | NULL | auto_increment |
| hash | char(32) | NO | UNI | NULL | |
| type | enum('P1andP2','P1Only','P2Only') | NO | | NULL | |
| parameter1 | int(11) | YES | | NULL | |
| parameter2 | int(11) | YES | | NULL | |
+-------------+-----------------------------------+------+-----+---------+----------------+
DESCRIBE `collection1`; (`collection2` is identical)
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| parameter | varchar(255) | NO | UNI | NULL | |
+-----------+--------------+------+-----+---------+----------------+
Any help will be appreciated!
Please use SHOW CREATE TABLE
; it is more descriptive than DESCRIBE
.
Use LAST_INSERT_ID()
SET @parameter1_id := (SELECT `id` FROM `collection1`
WHERE `parameter` = parameter1);
can be replaced by
SELECT @parameter1_id := LAST_INSERT_ID();
It will avoid a round trip to the server.
Oops... The OP points out that the id won't be returned if the row is a dup. This is a workaround that might run faster:
INSERT INTO `collection1` (`parameter`)
VALUES (parameter1)
ON DUPLICATE KEY UPDATE
id = LAST_INSERT_ID(id);
SELECT @parameter1 := LAST_INSERT_ID(id);
It's a kludgy trick that is documented somewhere in the documentation. But; more below...
Shrink table
Do you really need combination
? You have another UNIQUE
key that could be used as the PRIMARY KEY
. This might cut in half the time taken for the final INSERT
.
This may (or may not) speed things up, but only because the row size shrinks: Instead of storing the md5 into CHAR(32)
, store UNHEX(md5) into BINARY(16)
.
Batch INSERT
Can you gather a bunch of these to INSERT
at once? If you gather 1000 rows and string them into a single INSERT
(actually 3 INSERTs
, since 3 tables are involved), it will run literally 10 times as fast.
Because of needing the ids, it gets more complicated. You would need to batch things into collection1
and collection2
; then work on combinations
.
Since the "combination*" tables are essentially "normalization", see my discussion of how to batch them very efficiently: http://mysql.rjweb.org/doc.php/staging_table#normalization It involves 2 statements, one to insert new rows, the other to grab all the ids for the batch.
COALESCE
Get rid of @parameter*_hash
and @hash
completely. Change the use of @hash
call to:
INSERT IGNORE INTO combinations (...) VALUES
( CONCAT(COALESCE(parameter1,''), COALESCE(parameter2, '')),
...)
Think of it this way... Each statement takes a non-trivial amount of time. (This shows up significantly in batching of inserts.) I'm getting rid of 4 statements at some expense due to adding complexity to one statement.
Settings
The most important might be innodb_flush_log_at_trx_commit = 2
.
3 Streams
Write 3 procedures, each one with the code simplified to the particular type
. Combining this with batching should further speed things up.
Potential issues
I think these two will get the same hash
. Hence, only one row for these two:
("xyz", NULL) (NULL, "xyz")
Be aware that INSERT IGNORE
will burn ids if there is already a row with the given unique key. Because of this, keep an eye on running out of values with INT
(only 2 billion). Changing to INT UNSIGNED
would up it to 4B, still in 4 bytes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.