c# comparing list of IDs

Question

I have a List<Keyword> where Keyword class is:

public string keyword;
public List<int> ids;
public int hidden;
public int live;
public bool worked;

Keyword has its own keyword, a set of 20 ids, live by default is set to 1 and hidden to 0.

I just need to iterate over the whole main List to invalidate those keywords whose number of same ids is greater than 6, so comparing every pair, if the second one has more than 6 ids repeated respect to the first one, hidden is set to 1 and live to 0.

The algorithm is very basic but it takes too long when the main list has many elements.

I'm trying to guess if there could be any method I could use to increase the speed.

The basic algorithm I use is:

foreach (Keyword main_keyword in lista_de_keywords_live)
{
    if (main_keyword.worked) {
        continue;
    }
    foreach (Keyword keyword_to_compare in lista_de_keywords_live)
    {
        if (keyword_to_compare.worked || keyword_to_compare.id == main_keyword.id) continue;

        n_ids_same = 0;
        foreach (int id in main_keyword.ids)
        {
            if (keyword_to_compare._lista_models.IndexOf(id) >= 0)
            {
                if (++n_ids_same >= 6) break;
            }
        }

        if (n_ids_same >= 6)
        {
            keyword_to_compare.hidden = 1;
            keyword_to_compare.live   = 0;
            keyword_to_compare.worked = true;
        }
    }
}

Answer 1

The code below is an example of how you would use a HashSet for your problem. However, I would not recommend using it in this scenario. On the other hand, the idea of sorting the ids to make the comparison faster still. Run it in a Console Project to try it out.

Notice that once I'm done adding new ids to a keyword, I sort them. This makes the comparison faster later on.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;

namespace KeywordExample
{

    public class Keyword
    {
        public List<int> ids;
        public int hidden;
        public int live;
        public bool worked;

        public Keyword()
        {
            ids = new List<int>();
            hidden = 0;
            live = 1;
            worked = false;
        }

        public override string ToString()
        {
            StringBuilder s = new StringBuilder();
            if (ids.Count > 0)
            {
                s.Append(ids[0]);
                for (int i = 1; i < ids.Count; i++)
                {
                    s.Append(',' + ids[i].ToString());
                }
            }
            return s.ToString();
        }

    }

    public class KeywordComparer : EqualityComparer<Keyword>
    {
        public override bool Equals(Keyword k1, Keyword k2)
        {
            int equals = 0;
            int i = 0;
            int j = 0;

            //based on sorted ids
            while (i < k1.ids.Count && j < k2.ids.Count)
            {
                if (k1.ids[i] < k2.ids[j])
                {
                    i++;
                }
                else if (k1.ids[i] > k2.ids[j])
                {
                    j++;
                }
                else
                {
                    equals++;
                    i++;
                    j++;
                }
            }

            return equals >= 6;
        }
        public override int GetHashCode(Keyword keyword)
        {
            return 0;//notice that using the same hash for all keywords gives you an O(n^2) time complexity though.
        }
    }


    class Program
    {

        static void Main(string[] args)
        {
            List<Keyword> listOfKeywordsLive = new List<Keyword>();
            //add some values
            Random random = new Random();
            int n = 10;
            int sizeOfMaxId = 20;
            for (int i = 0; i < n; i++)
            {
                var newKeyword = new Keyword();
                for (int j = 0; j < 20; j++)
                {
                    newKeyword.ids.Add(random.Next(sizeOfMaxId) + 1);
                }
                newKeyword.ids.Sort(); //sorting the ids
                listOfKeywordsLive.Add(newKeyword);
            }

            //solution here
            HashSet<Keyword> set = new HashSet<Keyword>(new KeywordComparer());
            set.Add(listOfKeywordsLive[0]);
            for (int i = 1; i < listOfKeywordsLive.Count; i++)
            {
                Keyword keywordToCompare = listOfKeywordsLive[i];
                if (!set.Add(keywordToCompare))
                {
                    keywordToCompare.hidden = 1;
                    keywordToCompare.live = 0;
                    keywordToCompare.worked = true;
                }
            }

            //print all keywords to check
            Console.WriteLine(set.Count + "/" + n + " inserted");
            foreach (var keyword in set)
            {
                Console.WriteLine(keyword);
            }

        }

    }
}

Answer 2

The obvious source of inefficiency is the way you calculate intersection of two lists (of ids). The algorithm is O(n^2). This is by the way problem that relational databases solve for every join and your approach would be called loop join. The main efficient strategies are hash join and merge join. For your scenario the latter approach may be better I guess, but you can also try HashSets if you like.

The second source of inefficiency is repeating everything twice. As (a join b) is equal to (b join a), you do not need two cycles over the whole List<Keyword> . Actually, you only need to loop over the non duplicate ones.

Using some code from here , you can write the algorithm like:

Parallel.ForEach(list, k => k.ids.Sort());

List<Keyword> result = new List<Keyword>();

foreach (var k in list)
{
    if (result.Any(r => r.ids.IntersectSorted(k.ids, Comparer<int>.Default)
                             .Skip(5)
                             .Any()))
    {
        k.hidden = 1;
        k.live = 0;
        k.worked = true;
    }
    else
    {
        result.Add(k);
    }
}

If you replace the linq with just the index manipulation approach (see the link above), it would be a tiny bit faster I guess.

c# comparing list of IDs

Question

2 answers

solution1
2 ACCPTED 2017-10-23 00:30:31

solution2
1 2017-10-23 00:28:30

c# comparing list of IDs

Question

2 answers

solution1 2 ACCPTED 2017-10-23 00:30:31

solution2 1 2017-10-23 00:28:30

solution1
2 ACCPTED 2017-10-23 00:30:31

solution2
1 2017-10-23 00:28:30