简体   繁体   中英

Optimization Mysql Query Left Join

We want to map the entries of the calibration_data to the calibration data by following query. But the duration of this query is quite too long in my opinion (>24h).

Is there any optimization possible? We added for testing more Indexes as needed right now but it didn't had any impact on the duration.

[Edit]

The hardware shouldn't be the biggest bottleneck

  • 128 GB RAM
  • 1TB SSD RAID 5
  • 32 cores

EXPLAIN result

+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra                                          |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+
|  1 | SIMPLE      | cal   | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    2009 |   100.00 | Using temporary; Using filesort                |
|  1 | SIMPLE      | m     | NULL       | ALL  | visit         | NULL | NULL    | NULL | 3082466 |   100.00 | Range checked for each record (index map: 0x1) |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+

Query which takes too long:

Insert into knn_data (SELECT cal.X           AS X, 
        cal.Y           AS Y, 
        cal.BeginTime   AS BeginTime, 
        cal.EndTime     AS EndTime, 
        avg(m.dbm_ant)  AS avg_dbm_ant, 
        m.ant_id        AS ant_id, 
        avg(m.location) avg_location, 
        count(*)        AS count, 
        m.visit 
 FROM   calibration cal 
        LEFT join calibration_data m
          ON m.visit BETWEEN cal.BeginTime AND cal.EndTime 
 GROUP  BY cal.X, 
           cal.Y, 
           cal.BeginTime, 
           cal. BeaconId, 
           m.ant_id,
           m.macHash,
           m.visit; 

Table knn_data:

    CREATE TABLE `knn_data` (
  `X` int(11) NOT NULL,
  `Y` int(11) NOT NULL,
  `BeginTime` datetime NOT NULL,
  `EndTIme` datetime NOT NULL,
  `avg_dbm_ant` float DEFAULT NULL,
  `ant_id` int(11) NOT NULL,
  `avg_location` float DEFAULT NULL,
  `count` int(11) DEFAULT NULL,
  `visit` datetime NOT NULL,
  PRIMARY KEY (`ant_id`,`visit`,`X`,`Y`,`BeginTime`,`EndTIme`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Table calibration

BeaconId, X, Y, BeginTime, EndTime
41791, 1698, 3944, 2016-11-12 22:44:00, 2016-11-12 22:49:00


CREATE TABLE `calibration` (
  `BeaconId` int(11) DEFAULT NULL,
  `X` int(11) DEFAULT NULL,
  `Y` int(11) DEFAULT NULL,
  `BeginTime` datetime DEFAULT NULL,
  `EndTime` datetime DEFAULT NULL,
  KEY `x,y` (`X`,`Y`),
  KEY `x` (`X`),
  KEY `y` (`Y`),
  KEY `BID` (`BeaconId`),
  KEY `beginTime` (`BeginTime`),
  KEY `x,y,beg,bid` (`X`,`Y`,`BeginTime`,`BeaconId`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Table calibration_data

macHash, visit, dbm_ant, ant_id, mac, isRand, posX, posY, sources, ip, dayOfMonth, location, am, ar
'f5:dc:7d:73:2d:e9', '2016-11-12 22:44:00', '-87', '381', 'f5:dc:7d:73:2d:e9', NULL, NULL, NULL, NULL, NULL, '12', '18.077636300207715', 'inradius_41791', NULL


CREATE TABLE `calibration_data` (
  `macHash` varchar(100) COLLATE utf8_bin NOT NULL,
  `visit` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `dbm_ant` int(3) NOT NULL,
  `ant_id` int(11) NOT NULL,
  `mac` char(17) COLLATE utf8_bin DEFAULT NULL,
  `isRand` tinyint(4) DEFAULT NULL,
  `posX` double DEFAULT NULL,
  `posY` double DEFAULT NULL,
  `sources` int(2) DEFAULT NULL,
  `ip` int(10) unsigned DEFAULT NULL,
  `dayOfMonth` int(11) DEFAULT NULL,
  `location` varchar(80) COLLATE utf8_bin DEFAULT NULL,
  `am` varchar(300) COLLATE utf8_bin DEFAULT NULL,
  `ar` varchar(300) COLLATE utf8_bin DEFAULT NULL,
  KEY `visit` (`visit`),
  KEY `macHash` (`macHash`),
  KEY `ant, time` (`dbm_ant`,`visit`),
  KEY `beacon` (`am`),
  KEY `ant_id` (`ant_id`),
  KEY `ant,mH,visit` (`ant_id`,`macHash`,`visit`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Onetime task? Then it does not matter? After getting this data loaded, will you incrementally update the "summary table" each day?

Shrink datatypes -- bulky data takes longer to process. Example: a 4-byte INT DayOfMonth could be a 1-byte TINYINT UNSIGNED .

You are moving a TIMESTAMP into a DATETIME . This may or may not work as you expect.

INT UNSIGNED is OK for IPv4, but you can't fit IPv6 in it.

COUNT(*) probably does not need a 4-byte INT ; see the smaller variants.

Use UNSIGNED where appropriate.

A mac-address takes 19 bytes the way you have it; it could easily be converted to/from a 6-byte BINARY(6) . See REPLACE() , UNHEX() , HEX() , etc.

What is the setting of innodb_buffer_pool_size ? It could be about 100G for the big RAM you have.

Do the time ranges overlap? If not, take advantage of that. Also, don't include unnecessary columns in the PRIMARY KEY , such as EndTime .

Have the GROUP BY columns in the same order as the PRIMARY KEY of knn_data; this will avoid a lot of block splits during the INSERT .

The big problem is that there is no useful index in calibration_data , so the JOIN has to do a full table scan again and again! An extimated 2K scans of 3M rows! Let me focus on that problem...

There is no good way to do WHERE x BETWEEN start AND end because MySQL does not know whether the datetime ranges overlap. There is no real cure for that in this context, so let me approach it differently...

Are start and end 'regular'? Like every hour? Of so, we can do some sort of computation instead of the BETWEEN . Let me know if this is the case; I will continue my thoughts.

That's a nasty and classical one on "range" queries: the optimiser doesnt use your indexes and end up in a full table scan. In your explain plan ou can see this on column type=ALL .

Ideally you should have type=range and something in the key column

Some ideas:


I doubt that changing you jointure from

ON m.visit BETWEEN cal.BeginTime AND cal.EndTime 

to

ON m.visit >= cal.BeginTime AND m.visit <= cal.EndTime

will work, but still give it a try.


Do trigger an ANALYSE TABLE on both tables. This is will update the stats on your tables and might help the optimiser to take the right decision (ie using the indexes)


Change the query to this might also help to force the optimiser use indexes :

Insert into knn_data (SELECT cal.X           AS X, 
        cal.Y           AS Y, 
        cal.BeginTime   AS BeginTime, 
        cal.EndTime     AS EndTime, 
        avg(m.dbm_ant)  AS avg_dbm_ant, 
        m.ant_id        AS ant_id, 
        avg(m.location) avg_location, 
        count(*)        AS count, 
        m.visit 
 FROM   calibration cal 
        LEFT join calibration_data m
          ON m.visit >= cal.BeginTime 
 WHERE m.visit <= cal.EndTime 
 GROUP  BY cal.X, 
           cal.Y, 
           cal.BeginTime, 
           cal. BeaconId, 
           m.ant_id,
           m.macHash,
           m.visit; 

That's all I am thinking off...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM