简体   繁体   English

对 MySQL 子查询或范围内的连接使用索引

[英]Utilize indexes for MySQL subqueries or joins over ranges

I have a list of requests and their respective IP addresses (~2 million rows).我有一个请求列表及其各自的 IP 地址(约 200 万行)。 I'm trying to do a simple JOIN on a list of non-overlapping and complete list of IP ranges (~12 million rows).我正在尝试对 IP 范围的非重叠列表和完整列表(约 1200 万行)进行简单的JOIN I have indexed the IP ranges with ip_from b_tree ascending and ip_to b_tree ascending.我已经用ip_from b_tree 升序和ip_to b_tree 升序索引了 IP 范围。

I have tried several techniques for managing to combine the data from these two tables, all have shown to be very inefficient so far.我尝试了多种技术来管理合并这两个表中的数据,但到目前为止,所有技术都表明效率非常低。

I have tried regular JOIN , JOIN with maximum difference of IP range and using sub-queries.我已经尝试了常规JOINJOIN与 IP 范围的最大差异并使用子查询。 Using EXPLAIN they have all shown to have possible_keys , without using them.使用EXPLAIN他们都显示有possible_keys ,但没有使用它们。 I have tried using FORCE INDEX without any luck.我试过使用FORCE INDEX没有任何运气。

Regular select separately shows that the IP lookup should take about 2ms with SELECT * FROM ip_ranges WHERE INET_ATON(<some ip>) <= ip_to LIMIT 1;单独的常规选择显示IP查找应该花费大约2ms SELECT * FROM ip_ranges WHERE INET_ATON(<some ip>) <= ip_to LIMIT 1; and the request table takes about 16ms for every 200 lookups.请求表每 200 次查找大约需要 16 毫秒。

Here is my current query.这是我当前的查询。 This takes about 30 seconds to return any results simply because the indexes are not fully utilized:仅仅因为索引没有被充分利用,这需要大约 30 秒才能返回任何结果:

SELECT 
rs.fingerprint,
rs.ip,
ipr.country_code,
ipr.country_name,
ipr.region,
ipr.city,
ipr.isp_name,
ipr.domain_name,
ipr.usage_type
FROM requests AS rs
JOIN ip_ranges AS ipr ON INET_ATON(rs.ip) BETWEEN ipr.ip_from AND ipr.ip_to
LIMIT 10;

So, is there some way to optimize this for MySQL?那么,有什么方法可以为 MySQL 优化它吗? Or should I rather just call the database individually for each request using Python?还是我应该使用 Python 为每个请求单独调用数据库? (join them manually outside of SQL). (在 SQL 之外手动加入它们)。

Update:更新:

I have now tried converting each IP address into their respective numerical format stored in a DECIMAL(39) column called ip_numeric as suggested in the answers below.我现在已尝试将每个 IP 地址转换为存储在名为ip_numericDECIMAL(39)列中的各自数字格式,如下面的答案中所建议的。 39 is used to also support IPv6 addresses. 39 还用于支持 IPv6 地址。 Database still wont use index keys for range lookup.数据库仍然不会使用索引键进行范围查找。

Because a join can not be optimized on a FUNCTION RESULT (your INET_ATON of the IP address), it will not take advantage of the index.因为无法在 FUNCTION RESULT(您的 IP 地址的 INET_ATON)上优化连接,所以它不会利用索引。

To correct this, I would do the following... Apply the INET_ATON() of the address before inserting into the requests file.为了纠正这个问题,我将执行以下操作...在插入请求文件之前应用地址的 INET_ATON()。 This way, the IP address is already in its properly formatted standard in the file.这样,IP 地址就已经在文件中采用正确格式的标准了。 Do the same for the IP_Ranges (from and to) so they are also in proper pre-confirmed proper format consistency.对 IP_Ranges(从和到)执行相同的操作,使它们也具有适当的预先确认的适当格式一致性。

Then a join on the ip does not have to get assessed/converted every time before the "between" is applied to the test.然后,在将“介于”应用于测试之前,不必每次都评估/转换 ip 上的连接。

FEEDBACK回馈

Indexes on columns, not functions... No specific documents, just from experience.列上的索引,而不是函数......没有具体的文件,只是根据经验。 The index is based on the value of a COLUMN.该索引基于 COLUMN 的值。 If you are joining on a function result, it has to run that based on the original column each record.如果您要加入一个函数结果,它必须根据每个记录的原始列运行该结果。 So, by storing the pre-computed final value of the IP, you now HAVE that properly formatted address and the index can run directly on that with no more conversion.因此,通过存储 IP 的预先计算的最终值,您现在拥有正确格式的地址,并且索引可以直接在该地址上运行而无需更多转换。 Likewise, when populating the JOIN TO table with from/to addresses, you are now pre-forcing your data IN the final format for comparison.同样,当使用 from/to 地址填充 JOIN TO 表时,您现在将数据预先强制为最终格式以进行比较。

Much like date indexes.很像日期索引。 Just index on a date field, not a month / year.只是在日期字段上建立索引,而不是月/年。 Then when you run a query and you would want something like for the last month, you would not do a month( someDateColumn ) = 10 and year( someDateColumn ) = 2019. You would just do someDateColumn >= '2019-10-01' and someDateColumn < '2019-11-01'.然后当你运行一个查询并且你想要像上个月这样的东西时,你不会做一个月( someDateColumn ) = 10 和 year( someDateColumn ) = 2019. 你只会做 someDateColumn >= '2019-10-01'和 someDateColumn < '2019-11-01'。 An index on the date will work faster than the function comparison.日期上的索引将比函数比较更快。

You can add a virtual column to the table and index that:您可以向表和索引添加一个虚拟列:

ALTER TABLE requests ADD ip_numeric bigint GENERATED ALWAYS AS (INET_ATON(ip)) virtual;

CREATE INDEX ip_numeric_ind ON requests (ip_numeric)

Then use that in your query:然后在您的查询中使用它:

SELECT 
rs.fingerprint,
rs.ip,
ipr.country_code,
ipr.country_name,
ipr.region,
ipr.city,
ipr.isp_name,
ipr.domain_name,
ipr.usage_type
FROM requests AS rs
JOIN ip_ranges AS ipr ON ip_numeric BETWEEN ipr.ip_from AND ipr.ip_to
LIMIT 10;

If you can guarantee that the from..to pairs do not overlap, there is a way to significantly speed up such tables.如果您可以保证 from..to 对不重叠,则有一种方法可以显着加快此类表的速度。 It involves building a table with only ip_from , knowing that the ip_to is one less than the ip_from of the next row.它涉及构建一个只有ip_from的表,知道ip_to比下一行的ip_from少一个。

Discussion, including reference code for IPv4 and IPv6: http://mysql.rjweb.org/doc.php/ipranges讨论,包括 IPv4 和 IPv6 的参考代码: http : //mysql.rjweb.org/doc.php/ipranges

This may be a rare case where CURSORs work faster than trying to do it in a single query.可能是一种罕见的情况,其中CURSORs工作速度比尝试在单个查询中执行的速度快。

That is, doing 10 separate lookups with the above technique will be very fast.也就是说,使用上述技术进行 10 次单独的查找将非常快。 If you need to do 2M lookups, we need to start over.如果您需要进行 2M 查找,我们需要重新开始。 A thought: Sort the 2M and the 12M;一个想法:对2M和12M进行排序; match up as you go.边走边匹配。 (A la the "merge" part of traditional "sort-merge" algorithms.) (No, I have not thought through the details.) (类似于传统“排序合并”算法的“合并”部分。)(不,我还没有仔细考虑细节。)

It can be quite hard to get indexes to use ranges.让索引使用范围可能非常困难。 I recommend the following approach:我推荐以下方法:

  • Find the first range where the range ends on or after the given ip.查找范围在给定 ip 上或之后结束的第一个范围。
  • Join back in the range table to get the start point重新加入范围表以获取起点
  • compare!相比!

As SQL:作为 SQL:

select r.*, ir.*
from (select r.*,
             (select ir.ip_to
              from ip_ranges ir
              where ir.ip_to >= inet_aton(r.ip)
              order by ir.ip_to
              limit 1
             ) as range_to
      from requests r
     ) r join
     ip_ranges ir
     on ir.ip_to = r.range_to
where r.ip >= ir.ip_to;

This wants an index on ip_ranges(ip_to) , both for the correlated subquery and the final join .这需要ip_ranges(ip_to)上的索引,用于相关子查询和最终join

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM