简体   繁体   English

在 C# 中的 100 万个字符串列表中搜索字符串的更好方法

[英]Better way to search string in list of 1 million strings in C#

I have a product Catalog object with upto 1 million products in it.我有一个产品目录 object,其中包含多达 100 万个产品。 Following code shows the Catalog class with some test code to populate 1 million dummy products for test purpose:以下代码显示了目录 class 以及一些测试代码,用于填充 100 万个虚拟产品以进行测试:

public class Catalog
{
    Random random = new Random();

    long Id { get; set; }
    public string Name { get; set; }
    public List<string> Products { get; set; }

    public Catalog()
    {
        Products = new List<string>();
        addProducts();
    }

    private void addProducts()
    {
        for (int i = 0; i < 1000000; i++)
        {                
            Products.Add(random.Next(0, 100000000).ToString());
        }
    }
}

I have about 300-600 of Catalog objects (with about 1 million products each) and need to check if there are common/same products between any 2 Catalogs.我有大约 300-600 个目录对象(每个对象大约有 100 万个产品),需要检查任何 2 个目录之间是否有共同/相同的产品。 Just need to check.只需要检查。 I don't want to find out which are the same products.我不想找出哪些是相同的产品。 Logic that I am using is something like this:我使用的逻辑是这样的:

static bool SearchDuplicateProducts(Catalog catalogA, Catalog catalogB)
{
    var found = false;

    foreach (string product in catalogA.Products)
    {
        if (catalogB.Products.Contains(product))
        {
            found = true;
            break;
        }
    }

    return found;
}

Of course List<string> type for products is not the fastest way to search so I tried HashSet<string> .当然,产品的List<string>类型并不是最快的搜索方式,所以我尝试了HashSet<string> My tests showed about 200% increase in search speed in SearchDuplicateProducts() method when I used HashSet<> over List<> to hold Products.我的测试表明,当我使用HashSet<>而不是List<>来保存产品时, SearchDuplicateProducts()方法的搜索速度提高了大约 200%。

I am not sure though if using HashSet<string> for Product list is the best or most efficient way to achieve what I am trying in SearchDuplicateProducts() .我不确定是否将HashSet<string>用于 Product list 是实现我在SearchDuplicateProducts()中尝试的最佳或最有效的方法。 I want to know if there any way (by using third-party library, db, trie or an algorithm) that can give me better results: in terms of space and time complexity.我想知道是否有任何方法(通过使用第三方库、db、trie 或算法)可以给我更好的结果:在空间和时间复杂度方面。 If there is a choice between the two then I would prefer better time complexity.如果两者之间可以选择,那么我更喜欢更好的时间复杂度。

I have checked similar questions:我检查过类似的问题:

  1. Best Way to compare 1 million List of object with another 1 million List of object in c# 将 100 万个 object 列表与另外 100 万个 object 列表进行比较的最佳方法 c#
  2. How to quickly search through a very large list of strings / records on a database 如何快速搜索数据库中非常大的字符串/记录列表
  3. C#: Memory-efficient search through 2 million objects without external dependencies C#:通过 200 万个对象进行内存高效搜索,无需外部依赖

Thanks for your help.谢谢你的帮助。

Are you comparing each pair of the catalogs?您是否在比较每一对目录? That's insane!这太疯狂了!

First, you should purge duplicates from every catalog individually;首先,您应该单独清除每个目录中的重复项; your random.Next will likely produce a few.你的random.Next可能会产生一些。 Or just don't insert them.或者只是不插入它们。

Then you should run a single loop, through all catalogs, trying to insert each object into a hash set;然后你应该运行一个循环,遍历所有目录,尝试将每个 object 插入 hash 集合中; if it's already there - you found a duplicate.如果它已经在那里 - 你发现了一个重复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM