Improve performance of getting distinct items

Question

How can I convert below code to LINQ. Lists can sometimes contains 20k or 30k items. So I'm looking for something that improves performance and runs faster. Below is my code:

     if(list1 != null)
     {
       foreach (var item in list1)
       {
        if(!list2.Any( x => x.Name == item.Name && x.Number == item.Number))
        {
          list2.Add(item)
        }
       }
     }

I tried using Parallel.ForEach but it throws a "Collection was modified" error.

Answer 1

You can use the LINQ Distinct method. It needs an IEqualityComparer set up, but luckily the MSDN example has just what you need already written:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;

namespace ConsoleApplication1
{
    class Program
    {
        static Random rand = new Random();

        // see https://docs.microsoft.com/en-us/dotnet/api/system.linq.enumerable.distinct for Distinct()
        public class Product
        {
            public string Name { get; set; }
            public int Number { get; set; }
        }

        // Custom comparer for the Product class
        class ProductComparer : IEqualityComparer<Product>
        {
            // Products are equal if their names and product numbers are equal.
            public bool Equals(Product x, Product y)
            {

                //Check whether the compared objects reference the same data.
                if (Object.ReferenceEquals(x, y)) return true;

                //Check whether any of the compared objects is null.
                if (Object.ReferenceEquals(x, null) || Object.ReferenceEquals(y, null))
                    return false;

                //Check whether the products' properties are equal.
                return x.Number == y.Number && x.Name == y.Name;
            }

            // If Equals() returns true for a pair of objects 
            // then GetHashCode() must return the same value for these objects.

            public int GetHashCode(Product product)
            {
                //Check whether the object is null
                if (Object.ReferenceEquals(product, null)) return 0;

                //Get hash code for the Name field if it is not null.
                int hashProductName = product.Name == null ? 0 : product.Name.GetHashCode();

                //Get hash code for the Code field.
                int hashProductCode = product.Number.GetHashCode();

                //Calculate the hash code for the product.
                return hashProductName ^ hashProductCode;
            }

        }

        static string RandomLetter()
        {
            return (rand.Next((int)'A', (int)'Z' + 1)).ToString();
        }

        static List<Product> CreateTestData()
        {
            int nItems = 20000;
            List<Product> data = new List<Product>(nItems);
            for (int i = 1; i <= nItems; i++)
            {
                data.Add(new Product { Name = RandomLetter() + RandomLetter(), Number = i % 10 });
            }

            return data;
        }

        static void Main(string[] args)
        {
            var list1 = CreateTestData();
            Stopwatch sw = new Stopwatch();
            sw.Start();
            List<Product> noduplicates = list1.Distinct(new ProductComparer()).ToList();
            sw.Stop();
            Console.WriteLine($"x items: {list1.Count()} no duplicates: {noduplicates.Count()} Time: {sw.ElapsedMilliseconds} ms");

            List<Product> list2 = new List<Product>();
            if (list1 != null)
            {
                sw.Restart();
                foreach (var item in list1)
                {
                    if (!list2.Any(x => x.Name == item.Name && x.Number == item.Number))
                    {
                        list2.Add(item);
                    }
                }
                sw.Stop();
                Console.WriteLine($"x items: {list1.Count()} list2: {noduplicates.Count()} Time: {sw.ElapsedMilliseconds} ms");
            }

            Console.ReadLine();

        }
    }
}

Sample output:

x items: 20000 no duplicates: 6393 Time: 12 ms
x items: 20000 list2: 6393 Time: 4225 ms

If you already had some data, you could use the Union method instead, again using the comparer.

NB My RandomLetter() function does not do what I intended. But it suffices.

Answer 2

20 - 30k items are not so much. All you need is to replace the potentially slow linear search

list2.Any(x => x.Name == item.Name && x.Number == item.Number)

with fast lookup data structure.

The easiest is to build a HashSet with anonymous type containing the Name and Number properties. In order to do that, you can use the following handy custom extension method:

public static class Extensions
{
    public static HashSet<T> ToHashSet<T>(this IEnumerable<T> source, IEqualityComparer<T> comparer = null)
    {
        return new HashSet<T>(source, comparer);
    }
}

and the code in question would be like this:

if (list1 != null)
{
    var keys = list2.Select(item => new { item.Name, item.Number }).ToHashSet();
    foreach (var item in list1)
    {
        var key = new { item.Name, item.Number };
        if (!keys.Contains(key))
        {
            list2.Add(item);
            keys.Add(key);
        }
    }
}

This is not LINQ, but it doesn't need to be, since LINQ is for querying, while your code is for modification.

Answer 3

You can make list2 a ConcurrentBag type and do like this. im not 100% sure that it will work as intended though.

public class Item
{
    public string Name { get; set; }
    public int Number { get; set; }
}
public void test()
{
    var list1 = new List<Item>(); // add items to list1 or maybe load from a database?

    var list2 = new ConcurrentBag<Item>();

    Parallel.ForEach(list1.ToList(), (item, state, arg3) =>
    {
        if (!list2.Any(x => x.Name == item.Name && x.Number == item.Number))
        {
            list2.Add(item);
        }

    });

}

Answer 4

Group by your distiction and select the first record from the group and create your list.

Fastest if you don't need to provide existing values

var list2 = list1.GroupBy(i => new { i.Name, i.Number })
                 .Select(g=>g.First())
                 .ToList();

Simple if you have existing values (4 times slower than the next version)

if you list2 has preexisting values you can do something like this...

var keys = list2.ToList();
var toadd = list1.GroupBy(i => new { i.Name, i.Number })
                 .Where(g => !keys.Any(i => i.Name == g.Key.Name
                                         && i.Number == g.Key.Number))
                 .Select(g=>g.First());
list2.AddRange(toadd);

Fastest if you need to update a set with existing values

public static HashSet<T> ToHashSet<T>(this IEnumerable<T> items)
{
    return new HashSet<T>(items);
}

var keys = list2.Select(i => new { i.Name, i.Number }).ToHashSet();
var toadd = list1.GroupBy(i => new { i.Name, i.Number })
                 .Where(g => !keys.Contains(g.Key))
                 .Select(g => g.First());
list2.AddRange(toadd);

Improve performance of getting distinct items

Question

4 answers

solution1
3 2016-09-27 18:41:10

solution2
1 2016-09-27 18:43:08

solution3
0 2016-09-27 17:51:09

solution4
0 2016-09-27 19:29:41

Improve performance of getting distinct items

Question

4 answers

solution1 3 2016-09-27 18:41:10

solution2 1 2016-09-27 18:43:08

solution3 0 2016-09-27 17:51:09

solution4 0 2016-09-27 19:29:41

solution1
3 2016-09-27 18:41:10

solution2
1 2016-09-27 18:43:08

solution3
0 2016-09-27 17:51:09

solution4
0 2016-09-27 19:29:41