简体   繁体   English

我的查询多个LIKE语句和REGEXP可以更有效吗?

[英]Can my query with multiple LIKE statements and REGEXP be more efficient?

I'm constructing a dynamic query to select dropped domain names from my database. 我正在构建一个动态查询来从我的数据库中选择已删除的域名。 At the moment there are a dozen rows but I'm going to get data soon which will have records of up to 500,000 rows. 目前有十几行,但我很快就会得到数据,这些数据的记录最多可达500,000行。

The schema is just one table containing 4 columns: 模式只是一个包含4列的表:

CREATE TABLE `DroppedDomains` (
  `domainID` int(11) NOT NULL AUTO_INCREMENT,
  `DomainName` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
  `DropDate` date DEFAULT NULL,
  `TLD` varchar(5) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`domainID`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

I did not create the schema, this is the live database schema. 我没有创建架构,这是实时数据库架构。 Here's sample data: 这是样本数据:

在此输入图像描述

I've constructed probably the most complex type of query below. 我可能构建了下面最复杂的查询类型。 The criteria is as follows: 标准如下:

SELECT any number of domains which 选择任意数量的域

  1. Start with the word 'starts' 从“开始”一词开始
  2. End with the word 'ends' 以'结尾'一词结尾
  3. Contain the word 'containsThis' anywhere in the domain name 在域名中的任何位置包含“containsThis”一词
  4. Contain the word 'ContainsThisToo' anywhere in the domain name 在域名中的任何位置包含“ContainsThisToo”一词
  5. Include at least one digit 包括至少一位数字
  6. The domain name must be at least 49 characters. 域名必须至少为49个字符。 Multibytes need to count as one character( I used CHAR_LENGTH ). 多字节需要计为一个字符(我使用CHAR_LENGTH)。
  7. The domain name must be at least under 65 characters. 域名必须至少少于65个字符。
  8. The TLD must be 'org' TLD必须是'org'
  9. The DropDate needs to be later than 2009-11-01 DropDate需要晚于2009-11-01

Here's my query so far: 到目前为止,这是我的查询:

SELECT
*
FROM
DroppedDomains

WHERE

1=1

AND DomainName LIKE 'starts%ends'
AND DomainName LIKE '%containsThis%'
AND DomainName LIKE '%containsThisToo%'
AND DomainName LIKE '%-%'
AND DomainName REGEXP '[0-9]'
AND CHAR_LENGTH(DomainName) > 49
AND CHAR_LENGTH(DomainName) < 65
AND TLD = 'org'
AND DropDate > '2009-11-01'

Here are my questions 这是我的问题

  1. Would it extremely benefit the performance considering I'll have half a million rows, if I made the TLD column its own table and just make the TLD column a foreign key to that? 考虑到我将有50万行,如果我将TLD列作为自己的表并且只是将TLD列作为外键,那么它是否会极大地提高性能? There will only be 5 TLDs ( com, net, org, info, biz ). 只有5个TLD(com,net,org,info,biz)。 I realize there are more TLDs in the real world, but this application will only have 5. The user cannot specify their own TLD. 我意识到现实世界中有更多TLD,但此应用程序只有5个。用户无法指定自己的TLD。

  2. I know that REGEXP and 500,000 rows is probably a recipe for disaster. 我知道REGEXP和500,000行可能是灾难的一个方法。 Is there anyway I can avoid the REGEXP ? 无论如何我可以避免REGEXP吗?

  3. Are there any other optimizations to the query I can do? 我可以做的查询还有其他任何优化吗? Like merge LIKE s or use other functions such as maybe INSTR ? 像合并LIKE或使用其他功能,如INSTR And should I implement any specific sort of caching mechanism? 我应该实现任何特定的缓存机制吗?

When you have a LIKE pattern that starts with a constant prefix and you have an index on that field, then the index can be used to find the rows starting with the prefix very quickly. 如果LIKE模式以常量前缀开头,并且您在该字段上有索引,那么索引可用于快速查找以前缀开头的行。 Luckily you have exactly this situation here: 幸运的是,你在这里遇到了这种情况:

AND DomainName LIKE 'starts%ends'

If only a few of the values start with starts then these rows will be found very quickly and the other expressions will only be tested for these rows. 如果只有少数值以starts那么这些行将很快找到,其他表达式只会针对这些行进行测试。 You can check that the index is used by running EXPLAIN SELECT ... . 您可以通过运行EXPLAIN SELECT ...来检查是否使用了索引。

You should plan the indexes to be created according to the queries you plan to use. 您应该根据计划使用的查询来规划要创建的索引。

  • if you'll have queries that filter only by DropDate, then an index on the DropDate will be useful. 如果您的查询只能通过DropDate过滤,那么DropDate上的索引将非常有用。
  • if you'll have queries that group by TLD, then an index on TLD will be useful. 如果您有按TLD分组的查询,那么TLD索引将非常有用。
  • if you'll have queries that search only by length of DomainName, then you may consider adding a field DomainNameLength that has exactly that (and an index on this) so the length is not calculated every time you run the query. 如果您的查询只搜索DomainName的长度,那么您可以考虑添加一个具有该字段的DomainNameLength(以及一个索引),这样每次运行查询时都不会计算长度。
  • if you'll have queries that search (filter) by two fields (eg TLD and DropDate), then you probably need a 2-column index on these fields. 如果您有通过两个字段(例如TLD和DropDate)搜索(过滤)的查询,那么您可能需要在这些字段上使用2列索引。
  • etc... 等等...

If your only query you'll use is the complex one you mention, then Mark's advice (about an index on DomainName) is best. 如果您使用的唯一查询是您提到的复杂查询,那么Mark的建议(关于DomainName的索引)是最好的。

Regarding question 1 about TLD field: 关于TLD领域的问题1:

If you are really going to have only a small number (like 5) of options for this and you are not planning to use all available tlds, you could use the ENUM type . 如果您真的只有少量(如5个)选项,并且您不打算使用所有可用的tld,则可以使用ENUM类型

CREATE TABLE(
   ....
   tld ENUM('com', 'net', 'org', 'info', 'biz')
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM