简体   繁体   English

LINQ使用.Contains()搜索nvarchar(MAX)列非常慢

[英]LINQ Search nvarchar(MAX) column extremely slow using .Contains()

I have a .net core API and I am trying to search 4.4 million records using .Contains(). 我有一个.net核心API,并且尝试使用.Contains()搜索440万条记录。 This is obviously extremely slow - 26 seconds. 这显然非常慢-26秒。 I am just querying one column which is the name of the record. 我只查询一列,即记录的名称。 How is this problem generally solved when dealing with millions of records? 处理数百万条记录时通常如何解决此问题?

I have never worked with millions of records before so apart from the obvious altering of the .Select and .Take, I haven't tried anything too drastic. 除了.Select和.Take的明显更改之外,我从来没有处理过数百万条记录,因此,我还没有尝试过任何过于激烈的尝试。 I have spent many hours on this though. 我为此花了很多时间。

The other filters included in the .Where are only used when a user chooses to use them on the front end - The real problem is just searching by CompanyName. .Where中包含的其他过滤器仅在用户选择在前端使用它们时才使用-真正的问题只是按CompanyName搜索。

Note; 注意; I am using .ToArray() when returning the results. 返回结果时,我正在使用.ToArray()。

I have indexes in the database but cannot add one for CompanyName as it is Nvarchar(MAX). 我在数据库中有索引,但是不能为CompanyName添加一个索引,因为它是Nvarchar(MAX)。

I have also looked at the execution plan and it doesn't really show anything out of the ordinary. 我还查看了执行计划,它并没有真正显示出任何异常。

query = _context.Companies.Where(
    c => c.CompanyName.Contains(paging.SearchCriteria.companyNameFilter.ToUpper())
         && c.CompanyNumber.StartsWith(
                string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter)
                ? paging.SearchCriteria.companyNumberFilter.ToUpper()
                : ""
            )
         && c.IncorporationDate > paging.SearchCriteria.companyIncorperatedGreaterFilter
         && c.IncorporationDate < paging.SearchCriteria.companyIncorperatedLessThanFilter
    )
    .Select(x => new Company() {
                    CompanyName = x.CompanyName,
                    IncorporationDate = x.IncorporationDate,
                    CompanyNumber = x.CompanyNumber
                }
    )
    .Take(10);

I expect the query to take around 1 / 2 seconds as when I execute a like query in ssms it take about 1 / 2 seconds. 我希望查询大约需要1/2秒,因为当我在sms中执行类似查询时,大约需要1/2秒。

Here is the code being submitted to DB: 这是提交给数据库的代码:

Microsoft.EntityFrameworkCore.Database.Command: Information: Executing DbCommand [Parameters=[@__p_4='?' (DbType = Int32), @__ToUpper_0='?' (Size = 4000), @__p_1='?' (Size = 4000), @__paging_SearchCriteria_companyIncorperatedGreaterFilter_2='?' (DbType = DateTime2), @__paging_SearchCriteria_companyIncorperatedLessThanFilter_3='?' (DbType = DateTime2), @__p_5='?' (DbType = Int32)], CommandType='Text', CommandTimeout='30']
SELECT [t].[CompanyName], [t].[IncorporationDate], [t].[CompanyNumber]
FROM (
    SELECT TOP(@__p_4) [c].[CompanyName], [c].[IncorporationDate], [c].[CompanyNumber], [c].[ID]
    FROM [Companies] AS [c]
    WHERE (((((@__ToUpper_0 = N'') AND @__ToUpper_0 IS NOT NULL) OR (CHARINDEX(@__ToUpper_0, [c].[CompanyName]) > 0)) AND (((@__p_1 = N'') AND @__p_1 IS NOT NULL) OR ([c].[CompanyNumber] IS NOT NULL AND (@__p_1 IS NOT NULL AND (([c].[CompanyNumber] LIKE [c].[CompanyNumber] + N'%') AND (((LEFT([c].[CompanyNumber], LEN(@__p_1)) = @__p_1) AND (LEFT([c].[CompanyNumber], LEN(@__p_1)) IS NOT NULL AND @__p_1 IS NOT NULL)) OR (LEFT([c].[CompanyNumber], LEN(@__p_1)) IS NULL AND @__p_1 IS NULL))))))) AND ([c].[IncorporationDate] > @__paging_SearchCriteria_companyIncorperatedGreaterFilter_2)) AND ([c].[IncorporationDate] < @__paging_SearchCriteria_companyIncorperatedLessThanFilter_3)
) AS [t]
ORDER BY [t].[IncorporationDate] DESC
OFFSET @__p_5 ROWS FETCH NEXT @__p_4 ROWS ONLY

SOLVED! 解决了! With the help of both answers! 在两个答案的帮助下!

In the end as suggested, I tried full-text searching which was lightening fast but compromised accuracy of search results. 最后,按照建议,我尝试了全文搜索,该搜索虽然速度很快,但却降低了搜索结果的准确性。 In order to filter those results more accurately, I used .Contains on the query after applying the full-text search. 为了更准确地过滤这些结果,我在应用了全文搜索后在查询中使用了.contains。

Here is the code that works. 这是有效的代码。 Hopefully this helps others. 希望这对其他人有帮助。

//query = _context.Companies //.Where(c => c.CompanyName.StartsWith(paging.SearchCriteria.companyNameFilter.ToUpper()) //&& c.CompanyNumber.StartsWith(string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter) ? paging.SearchCriteria.companyNumberFilter.ToUpper() : "") //&& c.IncorporationDate > paging.SearchCriteria.companyIncorperatedGreaterFilter && c.IncorporationDate < paging.SearchCriteria.companyIncorperatedLessThanFilter) //.Select(x => new Company() { CompanyName = x.CompanyName, IncorporationDate = x.IncorporationDate, CompanyNumber = x.CompanyNumber }).Take(10); // query = _context.Companies //.Where(c => c.CompanyName.StartsWith(paging.SearchCriteria.companyNameFilter.ToUpper())// && c.CompanyNumber.StartsWith(string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter) ?page.SearchCriteria.companyNumberFilter.ToUpper():“”)// && c.IncorporationDate>分页.SearchCriteria.companyIncorperatedGreaterFilter && c.IncorporationDate <分页.SearchCriteria.companyIncorperatedLessThanFilter)//.Select(x => new Company(){ CompanyName = x.CompanyName,IncorporationDate = x.IncorporationDate,CompanyNumber = x.CompanyNumber})。Take(10);

            query = _context.Companies.Where(c => EF.Functions.FreeText(c.CompanyName, paging.SearchCriteria.companyNameFilter.ToUpper()));

            query = query.Where(x => x.CompanyName.Contains(paging.SearchCriteria.companyNameFilter.ToUpper()));

(I temporarily excluded the other filters for simplicity) (为简单起见,我暂时排除了其他过滤器)

Welcome to stack overflow. 欢迎堆栈溢出。 It looks like you are suffering from at least one of these three problems in your code and your architecture. 看起来您的代码和体系结构中至少遇到了这三个问题之一。

First: indexing 第一:建立索引

You've mentioned that this cannot be indexed but there is support in SQL Server for full text indexing at the very least. 您已经提到不能对此进行索引,但是SQL Server中至少支持全文索引

.Contains

This method isn't exactly suitable for the size of operation you're performing. 此方法并不完全适合您要执行的操作大小。 If possible, perhaps as a last resort, consider moving to a parameterized query. 如果可能的话,也许万不得已,请考虑移至参数化查询。 For now, however, it looks like you want to keep your business logic in the .net code rather than spreading it into SQL and that's a worthy plan. 但是,就目前而言,您似乎希望将业务逻辑保留在.net代码中,而不是将其传播到SQL中,这是一个值得的计划。

c.IncorporationDate

Date comparison can be a little costly in SQL Server. 在SQL Server中,日期比较可能会有点昂贵。 Once you're dealing with so many millions of rows you might get a lot of performance benefit from correctly partitioned tables and indexes . 一旦处理了数百万行,就可以从正确分区的表和索引中获得很多性能优势。

Consider whether or not these rows can change at all. 考虑这些行是否可以更改 Something named IncoporationDate sounds like it definitely should not be changed. 听起来像IncoporationDate东西听起来绝对不应该更改。 I suspect you may want to leverage that after reading the rest of these. 我怀疑您在阅读了其余内容后可能想利用它。

When you run the query in SSMS, it's probably cached for subsequent calls. 当您在SSMS中运行查询时,该查询可能已缓存以备后续调用。 The original query probably took similar time as the EF query. 原始查询可能会花费与EF查询相似的时间。 That said, there are disadvantages to parametrised queries - while you can better reuse execution plans in a parametrised query, this also means that the execution plan isn't necessarily the best for the actual query you're trying to run right now. 就是说,参数化查询有缺点-尽管您可以更好地重用参数化查询中的执行计划,但这也意味着执行计划不一定适合您现在尝试运行的实际查询。

For example, if you specify a CompanyNumber (which is easy to find in an index due to the StartsWith ), you can filter the data first by CompanyNumber, thus making the name search trivial (I assume CompanyNumber is unique, so either you get 0 records, or you get the one you get by CompanyNumber). 例如,如果指定一个CompanyNumber(由于使用了StartsWith ,很容易在索引中找到它),则可以StartsWith CompanyNumber过滤数据,从而使名称搜索变得很简单(我假设CompanyNumber是唯一的,所以得到0记录,或者您通过CompanyNumber获得一个记录)。 This might not be possible for the parametrised query, if its execution plan was optimized for looking up by name. 如果参数化查询的执行计划针对按名称查找进行了优化,则这对于参数化查询可能是不可能的。

But in the end, Contains is a performance killer. 但最终, Contains是性能杀手。 It needs to read every single byte of data in your table's CompanyName field; 它需要读取表的CompanyName字段中的每个数据字节; which usually means it has to read every single row, and process much of its data. 这通常意味着它必须读取每一行并处理大量数据。 Searching by a substring looks deceptively simple, but always carries heavy penalties - its complexity is linear with respect to data size. 通过子字符串进行搜索看似简单,但始终会受到重罚-其复杂度与数据大小呈线性关系。

One option is to find a way to avoid the Contains . 一种选择是找到一种避免使用Contains Users often ask for features they don't actually need. 用户经常要求他们实际上不需要的功能。 StartsWith might work just as well for most of the cases. 在大多数情况下, StartsWith可能同样适用。 But that's a business decision, of course. 但这当然是商业决定。

Another option would be finding a way to reduce the query as much as possible before you apply the Contains filter - if you only allow searching for company name with other filters that narrow the search down, you can save the DB server a lot of work. 另一种选择是在应用“ Contains过滤器之前找到一种尽可能减少查询的方法-如果仅允许使用其他过滤器来搜索公司名称,从而缩小搜索范围,则可以节省大量数据库服务器的工作。 This may be tricky, and can sometimes collide with the execution plan collission issue - you might want to add some way to avoid having the same execution plan for two queries that are wildly different; 这可能很棘手,有时可能会与执行计划冲突问题相冲突-您可能想添加某种方法来避免对两个截然不同的查询使用相同的执行计划。 an easy way in EF would be to build the query up dynamically, rather than trying for one expression: EF中的一种简单方法是动态建立查询,而不是尝试一个表达式:

var query = _context.Companies;
if (!string.IsNullOrEmpty(paging.SearchCriteria.companyNameFilter))
  query = query.Where(c => c.CompanyName.Contains(paging.SearchCriteria.companyNameFilter));
if (!string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter))
  query = query.Where(c => c.CompanyNumber.StartsWith(paging.SearchCriteria.companyNumberFilter));

// etc. for the rest of the query

This means that you actually have multiple parametrised queries that can each have their own execution plan, more in line with what the query actually does. 这意味着您实际上有多个参数化查询,每个查询都可以有自己的执行计划,这与查询的实际情况更加一致。 For some extreme cases, it might also be worthwhile to completely prevent execution plan caching (this is often useful in reports). 在某些极端情况下,完全防止执行计划缓存(在报表中通常很有用)也可能是值得的。

The final option is using full-text search. 最后的选择是使用全文搜索。 You can find plenty of tutorials on how to make this work. 您可以找到许多有关如何实现此目的的教程。 This works essentially by splitting the unformatted string data to individual words or phrases, and indexing those. 这实际上是通过将未格式化的字符串数据拆分为单个单词或短语,并对它们进行索引来实现的。 This means that a search for "hello world" doesn't necessarily return all the records that have "hello world" in the name, and it might also return records that have something else than "hello world" in the name. 这意味着搜索“ hello world”并不一定会返回名称中包含“ hello world”的所有记录,而且还可能会返回名称中包含“ hello world”以外的内容的记录。 Think Google Search rather than Contains . 认为Google搜索而不是Contains This can often be a great method for human-written text, but it can be very confusing for the user who doesn't understand why you'd return search results that are completely different from what he was searching for. 这对于人工书写的文本来说通常是一种很好的方法,但是对于不理解您为什么会返回与他搜索的内容完全不同的搜索结果的用户而言,这可能会非常令人困惑。 It also often doesn't work well if you need to do partial searches (eg searching for "Computer" might return "Computer, Inc.", but searching for "Comp" might return nothing). 如果您需要进行部分搜索,它通常也不能很好地工作(例如,搜索“ Computer”可能返回“ Computer,Inc.”,但是搜索“ Comp”可能不返回任何内容)。

The first option is likely the fastest, and closest to what the users would expect. 第一个选项可能是最快的,并且最接近用户的期望。 It has the weakness that it can't search in the middle, though. 但是,它具有无法在中间搜索的缺点。 The second option is the most correct, and might make your query substantially faster, especially in the most common cases with good statistics. 第二个选项是最正确的,它可能使您的查询速度大大提高,尤其是在统计数据良好的最常见情况下。 The third option is probably about as fast as the first one, but can be tricky to setup properly, and can be confusing for your users. 第三个选项可能与第一个选项一样快,但是正确设置可能会比较棘手,并且可能会使您的用户感到困惑。 It does also provide you with more powerful ways to query the text data (eg using wildcards). 它还确实为您提供了更强大的查询文本数据的方法(例如,使用通配符)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM