简体   繁体   English

提高 SQL 查询的性能

[英]Improve performance of SQL Query with dynamic like

I need to search for people whose FirstName is included (a substring of) in the FirstName of somebody else.我需要搜索其他人的名字中包含名字的人(一个 substring)。

SELECT DISTINCT top 10 people.[Id], peopleName.[LastName], peopleName.[FirstName]       
    FROM [dbo].[people] people
    INNER JOIN [dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
    WHERE EXISTS (SELECT * 
                        FROM [dbo].[people_NAME] peopleName2 
                        WHERE peopleName2.[Id] != people.[id] 
                            AND peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%')

It is so slow!太慢了! I know it's because of the "'%' + peopleName.[FirstName] + '%'" , because if I replace it with a hardcoded value like '%G%' , it runs instantly.我知道这是因为"'%' + peopleName.[FirstName] + '%'" ,因为如果我用像'%G%'这样的硬编码值替换它,它会立即运行。

With my dynamic like, my top 10 takes mores that 10 seconds.根据我的动态,我的前 10 名需要 10 秒。 I want to be able to run it on much bigger database.我希望能够在更大的数据库上运行它。

What can I do?我能做些什么?

This is a hard problem.这是一个难题。 I don't think a full text index will help, because you want to compare two columns.我认为全文索引不会有帮助,因为您想比较两列。

That doesn't leave good options.这并没有留下好的选择。 One possibility is to implement ngrams .一种可能性是实现ngrams These are sequences of characters (say, 3 in a row) that come from a string.这些是来自字符串的字符序列(例如,连续 3 个)。 From my first name, you would have:从我的名字开始,您将拥有:

gor
ord
rdo
don

Then you can use these for direct matching on another column.然后,您可以使用这些直接匹配另一列。 Then you have to do additional work to see if the full name for one column matches another.然后你必须做额外的工作来查看一列的全名是否与另一列匹配。 But the ngrams should significantly reduce the work space.但是 ngram 应该会显着减少工作空间。

Also, implementing ngrams requires work.此外,实现 ngrams 需要工作。 One method uses a trigger which calculates the ngrams for each name and then inserts them into an ngram table.一种方法使用触发器计算每个名称的 ngram,然后将它们插入到 ngram 表中。

I'm not sure if all this work is worth the effort to solve your problem.我不确定所有这些工作是否值得为解决您的问题而付出努力。 But it is possible to speed up the search.但是可以加快搜索速度。

Take a look at my answer about using LIKE operator here看看我关于在此处使用LIKE运算符的回答

It could be quite performant if you use some tricks如果您使用一些技巧,它可能会非常高效

You can gain much speed if you play with collation, try this:如果你玩排序规则,你可以获得很多速度,试试这个:

SELECT DISTINCT TOP 10 p.[Id], n.[LastName], n.[FirstName]       
FROM [dbo].[people] p
INNER JOIN [dbo].[people_NAME] n on n.[Id] = p.[Id]
WHERE EXISTS (
    SELECT 'x' x
    FROM [dbo].[people_NAME] n2
    WHERE n2.[Id] != p.[id]     
    AND 
        lower(n2.[FirstName]) collate latin1_general_bin 
        LIKE 
        '%' + lower(n1.[FirstName]) + '%' collate latin1_general_bin
)

As you can see we are using binary comparision instead of string comparision and this is much more performant.如您所见,我们使用的是二进制比较而不是字符串比较,这样的性能要高得多。

Pay attention, you are working with people's names, so you can have issues with special unicode characters or strange accents.. etc.. etc..请注意,您正在使用人名,因此您可能会遇到特殊的 unicode 字符或奇怪的口音……等等……等等。

Normally the EXISTS clause is better than INNER JOIN but you are using also a DISTINCT that is a GROUP BY on all columns.. so why not to use this?通常EXISTS子句比INNER JOIN好,但是您还使用了DISTINCT ,它是所有列上的GROUP BY .. 那么为什么不使用它呢?

You can switch to INNER JOIN and use the GROUP BY instead of the DISTINCT so testing COUNT(*)>1 will be (very little) more performant than testing WHERE n2.[Id].= p.[id] , especially if your TOP clause is extracting many rows.您可以切换到INNER JOIN并使用GROUP BY而不是DISTINCT因此测试COUNT(*)>1将(非常少)比测试WHERE n2.[Id].= p.[id]性能更高,特别是如果您的TOP 子句正在提取许多行。

Try this:尝试这个:

SELECT TOP 10 p.[Id], n.[LastName], n.[FirstName]
FROM [dbo].[people] p
INNER JOIN [dbo].[people_NAME] n on n.[Id] = p.[Id]
INNER JOIN [dbo].[people_NAME] n2 on 
    lower(n2.[FirstName]) collate latin1_general_bin 
    LIKE 
    '%' + lower(n1.[FirstName]) + '%' collate latin1_general_bin
GROUP BY n1.[Id], n1.[FirstName]
HAVING COUNT(*)>1

Here we are matching also the name itself, so we will find at least one match for each name.在这里,我们也匹配名称本身,因此我们将为每个名称找到至少一个匹配项。 But We need only names that matches other names, so we will keep only rows with match count greater than one (count(*)=1 means that name match only with itself).但是我们只需要与其他名称匹配的名称,因此我们将只保留匹配计数大于 1 的行(count(*)=1 表示名称仅与自身匹配)。

EDIT: I did all test using a random names table with 100000 rows and found that in this scenario, normal usage of LIKE operator is about three times worse than binary comparision.编辑:我使用具有 100000 行的随机名称表进行了所有测试,发现在这种情况下,LIKE 运算符的正常使用比二进制比较差大约三倍。

Have you tried a JOIN instead of a correlated query?.您是否尝试过 JOIN 而不是相关查询?

Being unable to use an index it won't have an optimal performance, but it should be a bit better than a correlated subquery.由于无法使用索引,它不会有最佳性能,但它应该比相关子查询好一点。

SELECT DISTINCT top 10 people.[Id], peopleName.[LastName], peopleName.[FirstName]       
FROM [dbo].[people] people
     INNER JOIN [dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
     INNER JOIN [dbo].[people_NAME] peopleName2 on peopleName2.[Id] <> people.[id] AND
                                                   peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%'

You can do this,你可以这样做,

With CTE as
(                       
    SELECT  top 10 peopleName.[Id], peopleName.[LastName], peopleName.[FirstName]       
    FROM 
    [dbo].[people_NAME] peopleName on peopleName.[Id] = people.[Id]
    WHERE EXISTS (SELECT 1 
                        FROM [dbo].[people_NAME] peopleName2 
                        WHERE peopleName2.[Id] != people.[id] 
                            AND peopleName2.[FirstName] LIKE '%' + peopleName.[FirstName] + '%')
    order by peopleName.[Id]                    
)

//here join CTE with people table if at all it is require //如果需要,这里将 CTE 与people表连接起来

select * from CTE

IF joining with people is not require then no need of CTE .如果不需要与people加入,则不需要CTE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM