在 C# 中的 100 万个字符串列表中搜索字符串的更好方法

Question

I have a product Catalog object with upto 1 million products in it.我有一个产品目录 object，其中包含多达 100 万个产品。 Following code shows the Catalog class with some test code to populate 1 million dummy products for test purpose:以下代码显示了目录 class 以及一些测试代码，用于填充 100 万个虚拟产品以进行测试：

public class Catalog
{
    Random random = new Random();

    long Id { get; set; }
    public string Name { get; set; }
    public List<string> Products { get; set; }

    public Catalog()
    {
        Products = new List<string>();
        addProducts();
    }

    private void addProducts()
    {
        for (int i = 0; i < 1000000; i++)
        {                
            Products.Add(random.Next(0, 100000000).ToString());
        }
    }
}

I have about 300-600 of Catalog objects (with about 1 million products each) and need to check if there are common/same products between any 2 Catalogs.我有大约 300-600 个目录对象（每个对象大约有 100 万个产品），需要检查任何 2 个目录之间是否有共同/相同的产品。 Just need to check.只需要检查。 I don't want to find out which are the same products.我不想找出哪些是相同的产品。 Logic that I am using is something like this:我使用的逻辑是这样的：

static bool SearchDuplicateProducts(Catalog catalogA, Catalog catalogB)
{
    var found = false;

    foreach (string product in catalogA.Products)
    {
        if (catalogB.Products.Contains(product))
        {
            found = true;
            break;
        }
    }

    return found;
}

Of course List<string> type for products is not the fastest way to search so I tried HashSet<string> .当然，产品的List<string>类型并不是最快的搜索方式，所以我尝试了HashSet<string> 。 My tests showed about 200% increase in search speed in SearchDuplicateProducts() method when I used HashSet<> over List<> to hold Products.我的测试表明，当我使用HashSet<>而不是List<>来保存产品时， SearchDuplicateProducts()方法的搜索速度提高了大约 200%。

I am not sure though if using HashSet<string> for Product list is the best or most efficient way to achieve what I am trying in SearchDuplicateProducts() .我不确定是否将HashSet<string>用于 Product list 是实现我在SearchDuplicateProducts()中尝试的最佳或最有效的方法。 I want to know if there any way (by using third-party library, db, trie or an algorithm) that can give me better results: in terms of space and time complexity.我想知道是否有任何方法（通过使用第三方库、db、trie 或算法）可以给我更好的结果：在空间和时间复杂度方面。 If there is a choice between the two then I would prefer better time complexity.如果两者之间可以选择，那么我更喜欢更好的时间复杂度。

I have checked similar questions:我检查过类似的问题：

Thanks for your help.谢谢你的帮助。

Answer 1

Are you comparing each pair of the catalogs?您是否在比较每一对目录？ That's insane!这太疯狂了！

First, you should purge duplicates from every catalog individually;首先，您应该单独清除每个目录中的重复项； your random.Next will likely produce a few.你的random.Next可能会产生一些。 Or just don't insert them.或者只是不插入它们。

Then you should run a single loop, through all catalogs, trying to insert each object into a hash set;然后你应该运行一个循环，遍历所有目录，尝试将每个 object 插入 hash 集合中； if it's already there - you found a duplicate.如果它已经在那里 - 你发现了一个重复。

在 C# 中的 100 万个字符串列表中搜索字符串的更好方法

问题描述

1 个解决方案

解决方案1
1 2021-02-27 22:10:30

在 C# 中的 100 万个字符串列表中搜索字符串的更好方法

问题描述

1 个解决方案

解决方案1 1 2021-02-27 22:10:30

解决方案1
1 2021-02-27 22:10:30