简体   繁体   English

Mysql在哪里查询优化

[英]Mysql where between query optimization

Below is the format of the database of Autonomous System Numbers ( download and parsed from this site ! ). 以下是自治系统号码数据库的格式(从该站点下载和解析!)。

range_start  range_end  number  cc  provider
-----------  ---------  ------  --  -------------------------------------
   16778240   16778495   56203  AU  AS56203 - BIGRED-NET-AU Big Red Group
   16793600   16809983   18144      AS18144

745465 total rows 总行数745465

A Normal query looks like this: 普通查询如下所示:

select * from table where 3232235520 BETWEEN range_start AND range_end

Works properly but I query a huge number of IPs to check for their AS information which ends up taking too many calls and time. 工作正常,但我查询大量的IP来检查他们的AS信息,最终需要花费太多的电话和时间。

Profiler Snapshot: Profiler快照:

Blackfire profiler snapshot Blackfire探查器快照

I've two indexes: 我有两个索引:

  1. id column id列
  2. a combine index on the range_start and range_end column as both the make unique row. range_start和range_end列上的组合索引作为make唯一行。

Questions: 问题:

  1. Is there a way to query a huge number of IPs in a single query? 有没有办法在单个查询中查询大量的IP?
    • multiple where (IP between range_start and range_end) OR where (IP between range_start and range_end) OR ... works but I can't get the IP -> row mapping or which rows are retrieved for which IP. multiple where (IP between range_start and range_end) OR where (IP between range_start and range_end) OR ...有效,但我无法获取IP - >行映射或检索哪些IP的行。
  2. Any suggestions to change the database structure to optimize the query speed and decrease the time? 有任何改变数据库结构的建议,以优化查询速度并减少时间吗?

Any help will be appreciated! 任何帮助将不胜感激! Thanks! 谢谢!

It is possible to query more than one IP address. 可以查询多个IP地址。 Several approaches we could take. 我们可以采取几种方法。 Assuming range_start and range_end are defined as integer types. 假设range_startrange_end被定义为整数类型。

For a reasonable number of ip addresses, we could use an inline view: 对于合理数量的ip地址,我们可以使用内联视图:

 SELECT i.ip, a.*
   FROM (           SELECT 3232235520 AS ip
          UNION ALL SELECT 3232235521
          UNION ALL SELECT 3232235522
          UNION ALL SELECT 3232235523
          UNION ALL SELECT 3232235524
          UNION ALL SELECT 3232235525
        ) i
   LEFT 
   JOIN ip_to_asn a
     ON a.range_start <= i.ip
    AND a.range_end   >= i.ip
  ORDER BY i.ip

This approach will work for a reasonable number of IP addresses. 此方法适用于合理数量的IP地址。 The inline view could be extended with more UNION ALL SELECT to add additional IP addresses. 可以使用更多UNION ALL SELECT扩展内联视图以添加其他IP地址。 But that's not necessarily going to work for a "huge" number. 但这并不一定适用于“巨大”数字。

When we get "huge", we're going to run into limitations in MySQL... maximum size of a SQL statement limited by max_allowed_packet , there may be a limit on the number of SELECT that can appear. 当我们变得“巨大”时,我们将在MySQL中遇到限制......由max_allowed_packet限制的SQL语句的最大大小,可能出现的SELECT数量有限制。

The inline view could be replaced with a temporary table, built first. 内联视图可以替换为首先构建的临时表。

 DROP TEMPORARY TABLE IF EXISTS _ip_list_;
 CREATE TEMPORARY TABLE _ip_list_ (ip BIGINT NOT NULL PRIMARY KEY) ENGINE=InnoDB;
 INSERT INTO _ip_list_ (ip) VALUES (3232235520),(3232235521),(3232235522),...;
 ...
 INSERT INTO _ip_list_ (ip) VALUES (3232237989),(3232237990);

Then reference the temporary table in place of the inline view: 然后引用临时表来代替内联视图:

 SELECT i.ip, a.*
   FROM _ip_list_ i
   LEFT
   JOIN ip_to_asn a
     ON a.range_start <= i.ip
    AND a.range_end   >= i.ip
  ORDER BY i.ip ;

And then drop the temporary table: 然后删除临时表:

 DROP TEMPORARY TABLE IF EXISTS _ip_list_ ;

Some other notes: 其他一些说明:

Churning database connections is going to degrade performance. 搅动数据库连接会降低性能。 There's a significant amount overhead in establishing and tearing down a connection. 建立和拆除连接需要大量的开销。 That overhead get noticeable if the application is repeatedly connecting and disconnecting, if its doing that for every SQL statement being issued. 如果应用程序重复连接和断开连接,如果它为每个发出的SQL语句执行此操作,则该开销会变得明显。

And running an individual SQL statement also has overhead... the statement has to be sent to the server, the statement parsed for syntax, evaluated from semantics, choose an execution plan, execute the plan, prepare a resultset, return the resultset to the client. 并且运行单个SQL语句也有开销...语句必须发送到服务器,语句解析为语法,从语义评估,选择执行计划,执行计划,准备结果集,将结果集返回到客户。 And this is why it's more efficient to process set wise rather than row wise. 这就是为什么处理set wise而不是row- wise更有效。 Processing RBAR (row by agonizing row) can be very slow, compared to sending a statement to the database and letting it process a set in one fell swoop. 与向数据库发送语句并让它一次处理集合相比,处理RBAR(通过痛苦行排)可能非常慢。

But there's a tradeoff there. 但那里有一个权衡。 With ginormous sets, things can start to get slow again. 随着巨大的集合,事情可能会开始变得缓慢。

Even if you can process two IP addresses in each statement, that halves the number of statements that need to be executed. 即使您可以在每个语句中处理两个 IP地址,也会将需要执行的语句数量减半 If you do 20 IP addresses in each statement, that cuts down the number of statements to 5% of the number that would be required a row at a time. 如果在每个语句中执行20个 IP地址,则会将语句数减少到一次所需行数的5%。


And the composite index already defined on (range_start,range_end) is appropriate for this query. 并且已在(range_start,range_end)上定义的复合索引适用于此查询。


FOLLOWUP 跟进

As Rick James points out in a comment, the index I earlier said was "appropriate" is less than ideal. 正如里克詹姆斯在评论中指出的那样,我之前所说的“适当”指数并不理想。

We could write the query a little differently, that might make more effective use of that index. 我们可以稍微改写一下查询,这可能会更有效地使用该索引。

If (range_start,range_end) is UNIQUE (or PRIMARY) KEY, then this will return one row per IP address, even when there are "overlapping" ranges. 如果(range_start,range_end)是UNIQUE(或PRIMARY)KEY,则每个IP地址返回一行,即使存在“重叠”范围。 (The previous query would return all of the rows that had a range_start and range_end that overlapped with the IP address.) (上一个查询将返回所有具有range_start和range_end且与IP地址重叠的行。)

 SELECT t.ip, a.*
   FROM ( SELECT s.ip
               , s.range_start
               , MIN(e.range_end) AS range_end
            FROM ( SELECT i.ip
                        , MAX(r.range_start) AS range_start
                     FROM _ip_list_ i
                     LEFT
                     JOIN ip_to_asn r
                       ON r.range_start <= i.ip
                    GROUP BY i.ip
                 ) s
            LEFT
            JOIN ip_to_asn e
              ON e.range_start = s.range_start
             AND e.range_end  >= s.ip
           GROUP BY s.ip, s.range_start
        ) t
   LEFT
   JOIN ip_to_asn a
     ON a.range_start = t.range_start
    AND a.range_end   = t.range_end
  ORDER BY t.ip ;

With this query, for the innermost inline view query s , the optimizer might be able to make effective use of an index with a leading column of range_start , to quickly identify the "highest" value of range_start (that is less than or equal to the IP address). 有了这个查询中,最里面的内联视图查询s ,优化器也许能有效地利用索引与领先的列range_start ,快速识别的“最高”值range_start (小于或等于IP地址)。 But with that outer join, and with the GROUP BY on i.ip , I'd really need to look at the EXPLAIN output; 但是对于那个外连接,以及i.ip上的GROUP BY,我真的需要查看EXPLAIN输出; it's only conjecture what the optimizer might do; 这只是推测优化器可能做的事情; what is important is what the optimizer actually does.) 重要的是优化器实际上做了什么。)

Then, for inline view query e , MySQL might be able to make more effective use of the composite index on (range_start,range_end) , because of the equality predicate on the first column, and the inequality condition on MIN aggregate on the second column. 然后,对于内联视图查询e ,MySQL可能能够更有效地使用复合索引(range_start,range_end) ,因为第一列上的等式谓词,以及第二列上MIN聚合上的不等式条件。

For the outermost query, MySQL will surely be able to make effective use of the composite index, due to the equality predicates on both columns. 对于最外层的查询,由于两列上的等式谓词,MySQL肯定能够有效地使用复合索引。

A query of this form might show improved performance, or performance might go to hell in a handbasket. 对此表单的查询可能会显示性能提升,或者性能可能会在手提箱中下降。 The output of EXPLAIN should give a good indication of what's going on. EXPLAIN的输出应该可以很好地指示正在发生的事情。 We'd like to see "Using index for group-by" in the Extra column, and we only want to see a "Using filesort" for the ORDER BY on the outermost query. 我们希望在Extra列中看到“使用index for group-by”,我们只希望在最外面的查询中看到ORDER BY的“Using filesort”。 (If we remove the ORDER BY clause, we want to not see "Using filesort" in the Extra column.) (如果我们删除ORDER BY子句,我们希望在Extra列中看不到“Using filesort”。)


Another approach is to make use of correlated subqueries in the SELECT list. 另一种方法是在SELECT列表中使用相关子查询。 The execution of correlated subqueries can get expensive when the resultset contains a large number of rows. 当结果集包含大量行时,相关子查询的执行可能会变得昂贵。 But this approach can give satisfactory performance for some use cases. 但是这种方法可以为某些用例提供令人满意的性能。

This query depends on no overlapping ranges in the ip_to_asn table, and this query will not produce the expected results when overlapping ranges exist. 此查询取决于 ip_to_asn表中没有重叠范围,并且当存在重叠范围时,此查询将不会产生预期结果。

 SELECT t.ip, a.*
   FROM ( SELECT i.ip
               , ( SELECT MAX(s.range_start)
                     FROM ip_to_asn s
                    WHERE s.range_start <= i.ip
                 ) AS range_start
               , ( SELECT MIN(e.range_end)
                     FROM ip_to_asn e
                    WHERE e.range_end >= i.ip
                 ) AS range_end
            FROM _ip_list_ i
        ) r
   LEFT 
   JOIN ip_to_asn a
     ON a.range_start = r.range_start
    AND a.range_end   = r.range_end

As a demonstration of why overlapping ranges will be a problem for this query, given a totally goofy, made up example 为了说明为什么重叠范围会成为这个查询的问题,给出一个完全愚蠢的例子

range_start  range_end 
-----------  ---------
       .101       .160
       .128       .244

Given an IP address of .140 , the MAX(range_start) subquery will find .128 , the MIN(range_end) subquery will find .160 , and then the outer query will attempt to find a matching row range_start=.128 AND range_end=.160 . 鉴于IP地址.140 ,在MAX(range_start)子查询就会发现.128 ,在MIN(range_end)子查询会找到.160 ,然后外部查询将试图找到一个匹配的行range_start=.128 AND range_end=.160 And that row just doesn't exist. 那一行就不存在了。

  1. You can compare IP ranges using MySQL. 您可以使用MySQL比较IP范围。 This question might contain an answer you're looking for: MySQL check if an IP-address is in range? 这个问题可能包含您正在寻找的答案: MySQL检查IP地址是否在范围内?
 SELECT * FROM TABLE_NAME WHERE (INET_ATON("193.235.19.255") BETWEEN INET_ATON(ipStart) AND INET_ATON(ipEnd)); 
  1. You will likely want to index your database. 您可能希望索引数据库。 This optimizes the time it takes to search your database, similar to the index you will find in the back of a textbook, but for databases: 这样可以优化搜索数据库所需的时间,类似于教科书背面的索引,但对于数据库:

     ALTER TABLE `table` ADD INDEX `name` (`column_id`) 

EDIT: Apparently INET_ATON cannot be used on indexed databases, so you would have to pick one of these! 编辑:显然INET_ATON不能用于索引数据库,所以你必须选择其中一个!

This is a duplicate of the question here however I'm not voting to close it, as the accepted answer in that question is not very helpful; 这是这里的问题的副本,但是我不投票关闭它,因为在那个问题中接受的答案不是很有帮助; the answer by Quassnoi is much better (but it only links to the solution). Quassnoi的答案要好得多(但它只链接到解决方案)。

A linear index is not going to help resolve a database of ranges. 线性索引不会帮助解析范围数据库。 The solution is to use geospatial indexing (available in MySQL and other DBMS). 解决方案是使用地理空间索引(在MySQL和其他DBMS中可用)。 An added complication is that MySQL geospatial indexing only works in 2 dimensions (while you have a 1-D dataset) so you need to map this to 2-dimensions. 更复杂的是,MySQL地理空间索引仅适用于2维(当您有1-D数据集时),因此您需要将其映射到2维。

Hence: 因此:

CREATE TABLE IF NOT EXISTS `inetnum` (
  `from_ip` int(11) unsigned NOT NULL,
  `to_ip` int(11) unsigned NOT NULL,
  `netname` varchar(40) default NULL,
  `ip_txt` varchar(60) default NULL,
  `descr` varchar(60) default NULL,
  `country` varchar(2) default NULL,
  `rir` enum('APNIC','AFRINIC','ARIN','RIPE','LACNIC') NOT NULL default 'RIPE',
  `netrange` linestring NOT NULL,
  PRIMARY KEY  (`from_ip`,`to_ip`),
  SPATIAL KEY `rangelookup` (`netrange`)
) ENGINE=MyISAM DEFAULT CHARSET=ascii;

Which might be populated with.... 哪个可能填充....

INSERT INTO inetnum
(from_ip, to_ip
 , netname, ip_txt, descr, country
 , netrange)
VALUES
(INET_ATON('127.0.0.0'), INET_ATON('127.0.0.2') 
 , 'localhost','127.0.0.0-127.0.0.2', 'Local Machine', '.',
GEOMFROMWKB(POLYGON(LINESTRING(
   POINT(INET_ATON('127.0.0.0'), -1), 
   POINT(INET_ATON('127.0.0.2'), -1),
   POINT(INET_ATON('127.0.0.2'), 1), 
   POINT(INET_ATON('127.0.0.0'), 1),
   POINT(INET_ATON('127.0.0.0'), -1))))
);

Then you might want to create a function to wrap the rather verbose SQL.... 然后你可能想创建一个函数来包装相当冗长的SQL ....

DROP FUNCTION `netname2`//
CREATE DEFINER=`root`@`localhost` FUNCTION `netname2`(p_ip VARCHAR(20) CHARACTER SET ascii) RETURNS varchar(80) CHARSET ascii
   READS SQL DATA
   DETERMINISTIC
BEGIN
  DECLARE l_netname varchar(80);

  SELECT CONCAT(country, '/',netname)
    INTO l_netname
  FROM inetnum
  WHERE MBRCONTAINS(netrange, GEOMFROMTEXT(CONCAT('POINT(', INET_ATON(p_ip), ' 0)')))
  ORDER BY (to_ip-from_ip)
  LIMIT 0,1;

  RETURN l_netname;
END

And therefore: 因此:

SELECT netname2('127.0.0.1');

./localhost

Which uses the index: 哪个使用索引:

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   SIMPLE  inetnum     range   rangelookup     rangelookup     34  NULL    1   Using where; Using filesort

(and takes around 10msec to find a record from the combined APNIC,AFRINIC,ARIN,RIPE and LACNIC datasets on the very low spec VM I'm using here) (并且需要大约10毫秒来查找我在这里使用的极低规格VM上的组合APNIC,AFRINIC,ARIN,RIPE和LACNIC数据集的记录)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM