C＃比较ID列表

Question

I have a List<Keyword> where Keyword class is: 我有一个List<Keyword> ，其中关键字类是：

public string keyword;
public List<int> ids;
public int hidden;
public int live;
public bool worked;

Keyword has its own keyword, a set of 20 ids, live by default is set to 1 and hidden to 0. 关键字有自己的关键字，一组20个ID，默认情况下，实时设置为1，隐藏为0。

I just need to iterate over the whole main List to invalidate those keywords whose number of same ids is greater than 6, so comparing every pair, if the second one has more than 6 ids repeated respect to the first one, hidden is set to 1 and live to 0. 我只需要遍历整个主列表，以使那些具有相同id的id大于6的关键字无效，因此比较每对，如果第二对相对于第一个具有重复的id超过6个，则hidden设置为1并活到0。

The algorithm is very basic but it takes too long when the main list has many elements. 该算法非常基础，但是当主列表包含许多元素时，它会花费很长时间。

I'm trying to guess if there could be any method I could use to increase the speed. 我试图猜测是否可以使用任何方法来提高速度。

The basic algorithm I use is: 我使用的基本算法是：

foreach (Keyword main_keyword in lista_de_keywords_live)
{
    if (main_keyword.worked) {
        continue;
    }
    foreach (Keyword keyword_to_compare in lista_de_keywords_live)
    {
        if (keyword_to_compare.worked || keyword_to_compare.id == main_keyword.id) continue;

        n_ids_same = 0;
        foreach (int id in main_keyword.ids)
        {
            if (keyword_to_compare._lista_models.IndexOf(id) >= 0)
            {
                if (++n_ids_same >= 6) break;
            }
        }

        if (n_ids_same >= 6)
        {
            keyword_to_compare.hidden = 1;
            keyword_to_compare.live   = 0;
            keyword_to_compare.worked = true;
        }
    }
}

Answer 1

The code below is an example of how you would use a HashSet for your problem. 下面的代码是如何使用HashSet解决问题的示例。 However, I would not recommend using it in this scenario. 但是，我不建议在这种情况下使用它。 On the other hand, the idea of sorting the ids to make the comparison faster still. 另一方面，对ID进行排序以使比较仍然更快的想法。 Run it in a Console Project to try it out. 在控制台项目中运行它以进行尝试。

Notice that once I'm done adding new ids to a keyword, I sort them. 请注意，完成向关键字添加新ID后，便对其进行排序。 This makes the comparison faster later on. 这样可以使以后的比较更快。

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;

namespace KeywordExample
{

    public class Keyword
    {
        public List<int> ids;
        public int hidden;
        public int live;
        public bool worked;

        public Keyword()
        {
            ids = new List<int>();
            hidden = 0;
            live = 1;
            worked = false;
        }

        public override string ToString()
        {
            StringBuilder s = new StringBuilder();
            if (ids.Count > 0)
            {
                s.Append(ids[0]);
                for (int i = 1; i < ids.Count; i++)
                {
                    s.Append(',' + ids[i].ToString());
                }
            }
            return s.ToString();
        }

    }

    public class KeywordComparer : EqualityComparer<Keyword>
    {
        public override bool Equals(Keyword k1, Keyword k2)
        {
            int equals = 0;
            int i = 0;
            int j = 0;

            //based on sorted ids
            while (i < k1.ids.Count && j < k2.ids.Count)
            {
                if (k1.ids[i] < k2.ids[j])
                {
                    i++;
                }
                else if (k1.ids[i] > k2.ids[j])
                {
                    j++;
                }
                else
                {
                    equals++;
                    i++;
                    j++;
                }
            }

            return equals >= 6;
        }
        public override int GetHashCode(Keyword keyword)
        {
            return 0;//notice that using the same hash for all keywords gives you an O(n^2) time complexity though.
        }
    }


    class Program
    {

        static void Main(string[] args)
        {
            List<Keyword> listOfKeywordsLive = new List<Keyword>();
            //add some values
            Random random = new Random();
            int n = 10;
            int sizeOfMaxId = 20;
            for (int i = 0; i < n; i++)
            {
                var newKeyword = new Keyword();
                for (int j = 0; j < 20; j++)
                {
                    newKeyword.ids.Add(random.Next(sizeOfMaxId) + 1);
                }
                newKeyword.ids.Sort(); //sorting the ids
                listOfKeywordsLive.Add(newKeyword);
            }

            //solution here
            HashSet<Keyword> set = new HashSet<Keyword>(new KeywordComparer());
            set.Add(listOfKeywordsLive[0]);
            for (int i = 1; i < listOfKeywordsLive.Count; i++)
            {
                Keyword keywordToCompare = listOfKeywordsLive[i];
                if (!set.Add(keywordToCompare))
                {
                    keywordToCompare.hidden = 1;
                    keywordToCompare.live = 0;
                    keywordToCompare.worked = true;
                }
            }

            //print all keywords to check
            Console.WriteLine(set.Count + "/" + n + " inserted");
            foreach (var keyword in set)
            {
                Console.WriteLine(keyword);
            }

        }

    }
}

Answer 2

The obvious source of inefficiency is the way you calculate intersection of two lists (of ids). 效率低下的明显原因是您计算两个列表（id）的交集的方式。 The algorithm is O(n^2). 该算法为O（n ^ 2）。 This is by the way problem that relational databases solve for every join and your approach would be called loop join. 这是关系数据库为每个联接解决的问题，您的方法将称为循环联接。 The main efficient strategies are hash join and merge join. 主要的有效策略是哈希联接和合并联接。 For your scenario the latter approach may be better I guess, but you can also try HashSets if you like. 对于您的情况，我猜后一种方法可能更好，但是如果您愿意，也可以尝试HashSets。

The second source of inefficiency is repeating everything twice. 效率低下的第二个原因是重复一切两次。 As (a join b) is equal to (b join a), you do not need two cycles over the whole List<Keyword> . 由于（a联接b）等于（b联接a），因此整个List<Keyword>不需要两个循环。 Actually, you only need to loop over the non duplicate ones. 实际上，您只需要遍历非重复项即可。

Using some code from here , you can write the algorithm like: 使用此处的一些代码，您可以编写如下算法：

Parallel.ForEach(list, k => k.ids.Sort());

List<Keyword> result = new List<Keyword>();

foreach (var k in list)
{
    if (result.Any(r => r.ids.IntersectSorted(k.ids, Comparer<int>.Default)
                             .Skip(5)
                             .Any()))
    {
        k.hidden = 1;
        k.live = 0;
        k.worked = true;
    }
    else
    {
        result.Add(k);
    }
}

If you replace the linq with just the index manipulation approach (see the link above), it would be a tiny bit faster I guess. 如果仅用索引操作方法替换linq（请参阅上面的链接），我想它会快一点。

C＃比较ID列表

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-10-23 00:30:31

解决方案2
1 2017-10-23 00:28:30

C＃比较ID列表

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-10-23 00:30:31

解决方案2 1 2017-10-23 00:28:30

解决方案1
2 已采纳 2017-10-23 00:30:31

解决方案2
1 2017-10-23 00:28:30