简体   繁体   English

提高“反向”LIKE查询的性能

[英]Improve the performance of a “reverse” LIKE query

I have a query that looks similar to this: 我有一个看起来类似于此的查询:

SELECT  CustomerId
FROM    Customer cust
WHERE   'SoseJost75G' LIKE cust.ClientCustomerId + '%' -- ClientCustomerId is SoseJost 

The jist of what this does is, I get a value from the customer that is my ClientCustomerId but with an unknown number of extra chars attached to the end. 这样做的ClientCustomerId是,我从客户获得的值是我的ClientCustomerId但附加了未知数量的额外字符。

So in my example, the customer gives me SoseJost75G but my database only has SoseJost (without the 75G on the end.) 所以在我的例子中,客户给了我SoseJost75G但我的数据库只有SoseJost (最后没有75G )。

My query works works. 我的查询有效。 But it takes over a minute to run. 但它需要一分多钟才能运行。 That is because it can't use the index that is on ClientCustomerId. 那是因为它无法使用ClientCustomerId上的索引。

Does anyone know a way to improve the performance of this kind of query? 有谁知道改善这种查询性能的方法?

You might try something like this: 你可能会尝试这样的事情:

DECLARE @var VARCHAR(100)='SoseJost75G';

WITH pre_selected AS
(SELECT * FROM Customer WHERE ClientCustomerId  LIKE LEFT(@var,6) + '%')
SELECT * 
FROM pre_selected WHERE @var LIKE ClientCustomerId +'%';

With a LIKE with fix start -search an existing index on ClientCustomerId will be used. 使用LIKE with fix start -search将使用ClientCustomerId上的现有索引。

With a CTE you never know exactly, which order of execution will take place, but - in some quick test - the optimizer chose first to reduce the set to a tiny rest and perform the heavy search as second step. 使用CTE,您永远不会确切知道,将执行哪个执行顺序,但是 - 在一些快速测试中 - 优化器首先选择将设置减少到很小的休息时间并执行重度搜索作为第二步。

If the order of execution is not the way you expect this, you might insert the result of the first CTE-query into a declared variable (only the column with the ID) and then continue with this tiny table... 如果执行顺序不是您期望的那样,您可以将第一个CTE查询的结果插入声明的变量(只有具有ID的列),然后继续使用这个小表...

Something like this 像这样的东西

DECLARE @var VARCHAR(100)='SoseJost75G';

DECLARE @CustIDs TABLE(ClientCustomerID VARCHAR(100));
INSERT INTO @CustIDs(ClientCustomerID)
SELECT ClientCustomerID FROM Customer WHERE ClientCustomerId LIKE LEFT(@var,6) + '%';

--Use this with an IN-clause then
SELECT ClientCustomerId 
FROM @CustIDs WHERE @var LIKE ClientCustomerID +'%'

So, the query to check an actual value is super fast (index seek). 因此,检查实际值的查询是超快的(索引搜索)。 So I am going to try out just running a bunch of separate select statements till I find a match. 因此,我将尝试运行一堆单独的select语句,直到找到匹配项。

DECLARE @customerIdSubstring varchar(255) = 'SoseJost75G'
DECLARE @customerIdSubstringLength INT
DECLARE @results TABLE 
(
    CustomerId varchar(255)
)


DECLARE @FoundResults BIT = 0;

WHILE (@FoundResults = 0)
BEGIN 

    INSERT INTO @results (CustomerId)
    SELECT  CustomerId
    FROM    Customer cust
    WHERE   CustomerId = @customerIdSubstring 


    SELECT @FoundResults = CASE 
                               WHEN EXISTS(SELECT * FROM @results) THEN CAST(1 AS BIT)
                               ELSE CAST(0 AS BIT)
                           END

    SET @customerIdSubstringLength = LEN(@customerIdSubstring)

    -- We don't want to match on fewer than 3 chars.  (May not be correct at that point.)
    IF (@customerIdSubstringLength < 3)
        BREAK;

    SET @customerIdSubstring = LEFT(@customerIdSubstring, @customerIdSubstringLength - 1)
END 

SELECT CustomerId
FROM @results

While it is possible that I will run the query many times. 虽然我可能会多次运行查询。 Inpractice, it will be 3-6 times per value. 实践中,每个值将是3-6倍。 I think 3-6 index seeks are better than 1 seek and 1 scan. 我认为3-6指数寻求优于1寻找和1扫描。

This also has the added benefit of returning only the most "LIKE" rows. 这还具有仅返回最“LIKE”行的额外好处。 (Meaning that rows that have SanJos will not return if there are rows that have SanJost .) (意思是有行SanJos如果有已行不会返回SanJost 。)

If you can specify the minimum ClientCustomerId length, eg it can never be less than four characters, you can limit the results thus: 如果您可以指定最小ClientCustomerId长度,例如它永远不会少于四个字符,则可以限制结果:

WHERE ClientCustomerId like left('SoseJost75G', 4) + '%'

Here an index can be used to get the matching records. 这里索引可用于获取匹配记录。 Your criteria 你的标准

AND ClientCustomerId <= 'SoseJost75G' and ClientCustomerId

would then have to be looked up only in the records already found. 然后必须仅在已找到的记录中查找。

The complete query: 完整的查询:

SELECT CustomerId
FROM Customer cust
WHERE ClientCustomerId like left('SoseJost75G', 4) + '%'
AND ClientCustomerId <= 'SoseJost75G' and ClientCustomerId;

BTW: Your criteria can also be written as 顺便说一句:您的标准也可以写成

ClientCustomerId = left('SoseJost75G', length(ClientCustomerId))

but I suppose that this isn't faster than your version. 但我想这并不比你的版本快。

I liked your approach, Vaccano. 我喜欢你的方法,Vaccano。 I just simplified it a bit, in case you are interested: 我只是简化了一下,以防你感兴趣:

DECLARE @customerIdSubstring varchar(255) = 'SoseJost75G'
DECLARE @results TABLE 
(
    CustomerId varchar(255)
)

DECLARE @FoundResults BIT = 0
DECLARE @customerIdSubstringLength INT = LEN(@customerIdSubstring)

WHILE (@FoundResults = 0 AND @customerIdSubstringLength >= 3)
BEGIN 
    INSERT INTO @results
    SELECT  CustomerId
    FROM    Customer
    WHERE   CustomerId = @customerIdSubstring

    -- Make @FoundResults = 1 if there's at least one record
    SELECT TOP 1 @FoundResults = 1 FROM @results

    SET @customerIdSubstringLength = @customerIdSubstringLength - 1
    SET @customerIdSubstring = LEFT(@customerIdSubstring, @customerIdSubstringLength)
END

SELECT CustomerId
FROM @results

If you are totally sure that only one ID will match, you can simplify this even further by removing the results table, which would only have one row. 如果您完全确定只有一个ID匹配,则可以通过删除结果表来进一步简化此操作,结果表只有一行。 I also removed the assignment to @customerIdSubstring in the loop: 我还在循环中删除了@customerIdSubstring的赋值:

DECLARE @customerIdSubstring varchar(255) = 'SoseJost75G'
DECLARE @customerIdFound varchar(255)

DECLARE @customerIdSubstringLength INT = LEN(@customerIdSubstring)

WHILE (@customerIdFound IS NULL AND @customerIdSubstringLength >= 3)
BEGIN 
    SELECT  @customerIdFound = CustomerId
    FROM    Customer
    WHERE   CustomerId = LEFT(@customerIdSubstring, @customerIdSubstringLength)

    SET @customerIdSubstringLength = @customerIdSubstringLength - 1
END

SELECT @customerIdFound

Basically, with your statement there is nothing wrong, because you could write it as a sargable query. 基本上,使用您的语句没有任何错误,因为您可以将其写为sargable查询。

SARG = Search Argument SARG =搜索参数

A sargable query allows the optimizer to use indices, while for not sargable queries the optimizer will have to scan all rows in the table, even indexes are available. sargable查询允许优化器使用索引,而对于不可搜索的查询,优化器必须扫描表中的所有行,甚至索引都可用。

A LIKE with the % at the end is sargable. 最后一个像%的LIKE是可以攻击的。 A LIKE withe the % at the beginning is NOT sargable. 像开头的%那样是不可思议的。 Applying a function like LEFT([Column], 4) + '%' in the WHERE clause makes the query not sargable. 在WHERE子句中应用类似LEFT([Column],4)+'%'的函数会使查询无法进行搜索。 At least the documentation on SARG says so. 至少SARG的文件是这样说的。

[COLUMN] LIKE 'abc%' -> sargable
[COLUMN] LIKE '%abc' -> not sargable
[COLUMN] LIKE LEFT('ABCDE', 4) -> not sargable

I think you should redesign the process before starting any query. 我认为您应该在开始任何查询之前重新设计该过程。 Setup a proper ETL-Porcess for separating the ID and the suffix. 设置适当的ETL-Porcess以分隔ID和后缀。 Store that data in separate columns and configure the indices as required. 将该数据存储在单独的列中,并根据需要配置索引。 Than run your query on the transformed data. 然后对转换后的数据运行查询。

This is all the more the preferred process, because you do not know what data you get. 这是更受欢迎的首选流程,因为您不知道自己获得了哪些数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM