简体   繁体   English

跨多个属性搜索List <T>的最快方法是什么?

[英]What is the fastest way to search a List<T> across multiple properties?

I have a process I've inherited that I'm converting to C# from another language. 我有一个继承过程,我正在从另一种语言转换为C#。 Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. 过程中的许多步骤循环通过可以进行大量记录(100K-200K)来进行计算。 As part of those processes it generally does a lookup into another list to retrieve some values. 作为这些过程的一部分,它通常会查找另一个列表以检索某些值。 I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. 我通常会把这种事情转移到一个SQL语句中(我们已经能够实现这一点),但在这些情况下,实际上没有一种简单的方法可以做到这一点。 In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped. 在某些地方,我们试图将代码转换为存储过程,并认为它的工作效果不如我们希望的那样好。

Effectively, the code does this: 实际上,代码执行此操作:

var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) && 
                       r.year == record.year && 
                       r.period == record.period).FirstOrDefault();

cost is a local List type. cost是本地List类型。 If I was doing a search on only one field I'd probably just move this into a Dictionary. 如果我只在一个字段上进行搜索,我可能只是将其移动到字典中。 The records aren't always unique either. 记录也不总是唯一的。

Obviously, this is REALLY slow. 显然,这真的很慢。

I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). 我遇到了可以构建索引的开源库I4O ,但是在各种查询中我都失败了(我没有时间尝试调试源代码)。 It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC"). 它也不适用于.StartsWith或.Contains(StartsWith更重要,因为许多原始查询利用了搜索“A”会在“ABC”中找到匹配的事实)。

Are there any other projects (open source or commercial) that do this sort of thing? 有没有其他项目(开源或商业)做这种事情?

EDIT: 编辑:

I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique. 我根据反馈进行了一些搜索,发现了Power Collections ,它支持具有非唯一键的字典。

I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. 我测试了ToLookup()效果很好 - 它仍然不如原始代码快,但它至少是可以接受的。 It's down from 45 seconds to 3-4 seconds. 它从45秒下降到3-4秒。 I'll take a look at the Trie structure for the other look ups. 我将看看Trie结构的其他看起来。

Thanks. 谢谢。

Certainly you can do better than this. 当然你可以比这更好。 Let's start by considering that dictionaries are not useful only when you want to query one field; 让我们首先考虑只有当你想查询一个字段时字典才有用; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. 你可以很容易地得到一个字典,其中键是一个聚合许多字段的不可变值。 So for this particular query, an immediate improvement would be to create a key type: 因此,对于此特定查询,立即改进将是创建密钥类型:

// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
    public int year;
    public int period;
}

and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. 然后将数据打包成IDictionary<Key, ICollection<T>>或类似的,其中T是当前列表的类型。 This way you can cut down heavily on the number of rows considered in each iteration. 这样,您可以大大减少每次迭代中考虑的行数。

The next step would be to use not an ICollection<T> as the value type but a trie ( this looks promising), which is a data structure tailored to finding strings that have a specified prefix. 下一步是不使用ICollection<T>作为值类型而是使用trie看起来很有希望),这是一种为查找具有指定前缀的字符串而定制的数据结构。

Finally, a free micro-optimization would be to take the TrimEnd out of the loop. 最后,免费的微优化将使TrimEnd脱离循环。

Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar. 现在肯定所有这些仅适用于给定的具体示例,并且由于您的情况的其他细节可能需要重新审视,但无论如何您应该能够从这个或类似的东西中获取实际收益。

Looping through a list of 100K-200K items doesn't take very long. 循环通过100K-200K项目列表不需要很长时间。 Finding matching items within the list by using nested loops (n^2) does take long. 使用嵌套循环(n ^ 2)查找列表中的匹配项确实需要很长时间。 I infer this is what you're doing (since you have assignment to a local match variable). 我推断这是你正在做的事情(因为你已经分配了一个本地匹配变量)。

If you want to quickly match items together, use .ToLookup . 如果要快速匹配项目,请使用.ToLookup

var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});

foreach(var group in lookup)
{
  // do something with items in group.
}

Your startswith criteria is troublesome for key-based matching. 您的启动标准对于基于密钥的匹配很麻烦。 One way to approach that problem is to ignore it when generating keys. 解决该问题的一种方法是在生成密钥时忽略它。

var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))

Ideally, you would create the lookup once and reuse it for many queries. 理想情况下,您可以创建一次查找并将其重用于许多查询。 Even if you didn't... even if you created the lookup each time, it will still be faster than n^2. 即使你没有......即使你每次都创建了查找,它仍然会比n ^ 2更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM