[英]Better way to search string in list of 1 million strings in C#
I have a product Catalog object with upto 1 million products in it.我有一个产品目录 object,其中包含多达 100 万个产品。 Following code shows the Catalog class with some test code to populate 1 million dummy products for test purpose:以下代码显示了目录 class 以及一些测试代码,用于填充 100 万个虚拟产品以进行测试:
public class Catalog
{
Random random = new Random();
long Id { get; set; }
public string Name { get; set; }
public List<string> Products { get; set; }
public Catalog()
{
Products = new List<string>();
addProducts();
}
private void addProducts()
{
for (int i = 0; i < 1000000; i++)
{
Products.Add(random.Next(0, 100000000).ToString());
}
}
}
I have about 300-600 of Catalog objects (with about 1 million products each) and need to check if there are common/same products between any 2 Catalogs.我有大约 300-600 个目录对象(每个对象大约有 100 万个产品),需要检查任何 2 个目录之间是否有共同/相同的产品。 Just need to check.只需要检查。 I don't want to find out which are the same products.我不想找出哪些是相同的产品。 Logic that I am using is something like this:我使用的逻辑是这样的:
static bool SearchDuplicateProducts(Catalog catalogA, Catalog catalogB)
{
var found = false;
foreach (string product in catalogA.Products)
{
if (catalogB.Products.Contains(product))
{
found = true;
break;
}
}
return found;
}
Of course List<string>
type for products is not the fastest way to search so I tried HashSet<string>
.当然,产品的List<string>
类型并不是最快的搜索方式,所以我尝试了HashSet<string>
。 My tests showed about 200% increase in search speed in SearchDuplicateProducts()
method when I used HashSet<>
over List<>
to hold Products.我的测试表明,当我使用HashSet<>
而不是List<>
来保存产品时, SearchDuplicateProducts()
方法的搜索速度提高了大约 200%。
I am not sure though if using HashSet<string>
for Product list is the best or most efficient way to achieve what I am trying in SearchDuplicateProducts()
.我不确定是否将HashSet<string>
用于 Product list 是实现我在SearchDuplicateProducts()
中尝试的最佳或最有效的方法。 I want to know if there any way (by using third-party library, db, trie or an algorithm) that can give me better results: in terms of space and time complexity.我想知道是否有任何方法(通过使用第三方库、db、trie 或算法)可以给我更好的结果:在空间和时间复杂度方面。 If there is a choice between the two then I would prefer better time complexity.如果两者之间可以选择,那么我更喜欢更好的时间复杂度。
I have checked similar questions:我检查过类似的问题:
Thanks for your help.谢谢你的帮助。
Are you comparing each pair of the catalogs?您是否在比较每一对目录? That's insane!这太疯狂了!
First, you should purge duplicates from every catalog individually;首先,您应该单独清除每个目录中的重复项; your random.Next
will likely produce a few.你的random.Next
可能会产生一些。 Or just don't insert them.或者只是不插入它们。
Then you should run a single loop, through all catalogs, trying to insert each object into a hash set;然后你应该运行一个循环,遍历所有目录,尝试将每个 object 插入 hash 集合中; if it's already there - you found a duplicate.如果它已经在那里 - 你发现了一个重复。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.