简体   繁体   English

有两个非常大的列表/集合-如何有效地检测和/或删除重复项

[英]With two very large lists/collections - how to detect and/or remove duplicates efficiently

Why 为什么

When comparing and deduplicating across two lists coders don't often find the most runtime-efficient implementation while under time-pressure. 在两个列表之间进行比较和重复数据删除时,编码人员常常在时间压力下找不到运行时效率最高的实现。 Two nested for-loops is a common goto solution for many coders. 对于许多编码器,两个嵌套的for循环是常见的goto解决方案。 One might try a CROSS JOIN with LINQ, but this is clearly inefficient. 可以尝试使用LINQ进行CROSS JOIN,但这显然效率不高。 Coders need a memorable and code-efficient approach for this that is also relatively runtime-efficient. 为此,编码人员需要一种令人难忘且代码高效的方法,同时也要相对节省运行时间。

This question was created after seeing a more specific one: Delete duplicates in a single dataset relative to another one in C# - it's more specialised with the use of Datasets. 这个问题是在看到一个更具体的问题之后提出的: 相对于C#中的另一个数据,删除单个数据集中的重复项 -使用数据集更加专业。 The term "dataset" would not help people in the future. 术语“数据集”将来不会对人们有所帮助。 No other generalised question was found. 找不到其他一般性问题。

What 什么

I have used the term List/Collection to help with this more general coding problem. 我使用术语“列表/集合”来解决这个更一般的编码问题。

var setToDeduplicate = new List<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M 

var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M

var deduplicatedSet = deduplicationFunction(setToDeduplicate, referenceSet);

By implementing the deduplicationFunction function the input data and output should be clear. 通过实现deduplicationFunction函数,输入数据和输出应清晰可见。 The output can be IEnumerable. 输出可以是IEnumerable。 The expected output in this input example would be the even numbers from 1-1M {2,4,6,8,...} 在此输入示例中,预期输出将是1-1M {2,4,6,8,...}中的偶数

Note: There may be duplicates within the referenceSet. 注意:referenceSet中可能有重复项。 The values in both sets are indicative only, so I'm not looking for a mathematical solution - this should also work for random number inputs in both sets. 两组中的值仅是指示性的,因此我不希望找到数学解决方案-这也适用于两组中的随机数输入。

If this is approached with simple LINQ functions it will be too slow O(1M*0.5M). 如果使用简单的LINQ函数来实现,则O(1M * 0.5M)会太慢。 There needs to be a faster approach for such large sets. 对于如此大的集合,需要一种更快的方法。

Speed is important, but incremental improvements with a large bloat of code will be of less value. 速度很重要,但是随着大量代码的增加而进行的改进的价值将降低。 Also, ideally it would work for other datatypes including data model objects, but answering this specific question should be enough. 同样,理想情况下,它也适用于其他数据类型,包括数据模型对象,但回答此特定问题应该就足够了。 Other datatypes would simply involve some more pre-processing or slight change to the answer. 其他数据类型将仅涉及更多的预处理或对答案的轻微更改。

Solution Summary 解决方案摘要

Here's the test code, for results which follow: 这是测试代码,其结果如下:

using System;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Preparing...");

            List<int> set1 = new List<int>();
            List<int> set2 = new List<int>();

            Random r = new Random();
            var max = 10000;

            for (int i = 0; i < max; i++)
            {
                set1.Add(r.Next(0, max));
                set2.Add(r.Next(0, max/2) * 2);
            }

            Console.WriteLine("First run...");

            Stopwatch sw = new Stopwatch();
            IEnumerable<int> result;
            int count;

            while (true)
            {
                sw.Start();
                result = deduplicationFunction(set1, set2);
                var results1 = result.ToList();
                count = results1.Count;
                sw.Stop();
                Console.WriteLine("Dictionary and Where - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                sw.Start();
                result = deduplicationFunction2(set1, set2);
                var results2 = result.ToList();
                count = results2.Count;
                sw.Stop();
                Console.WriteLine("  HashSet ExceptWith - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();

                sw.Start();
                result = deduplicationFunction3(set1, set2);
                var results3 = result.ToList();
                count = results3.Count;
                sw.Stop();
                Console.WriteLine("     Sort Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();

                sw.Start();
                result = deduplicationFunction4(set1, set2);
                var results4 = result.ToList();
                count = results3.Count;
                sw.Stop();
                Console.WriteLine("Presorted Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                set2.RemoveAt(set2.Count - 1); //Remove the last item, because it was added in the 3rd test

                sw.Start();
                result = deduplicationFunction5(set1, set2);
                var results5 = result.ToList();
                count = results5.Count;
                sw.Stop();
                Console.WriteLine("        Nested Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                Console.ReadLine();

                Console.WriteLine("");
                Console.WriteLine("Next Run");
                Console.WriteLine("");
            }

        }


        //Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
        static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
        {
            //Create a hashset first, which is much more efficient for searching
            var ReferenceHashSet = Reference
                                .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                                .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

            int throwAway;
            return Set.Distinct().Where(y => ReferenceHashSet.TryGetValue(y, out throwAway) == false);
        }

        //Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
        static IEnumerable<int> deduplicationFunction2(List<int> Set, List<int> Reference)
        {
            //Create a hashset first, which is much more efficient for searching
            var SetAsHash = new HashSet<int>();

            Set.ForEach(x =>
            {
                if (SetAsHash.Contains(x))
                    return;

                SetAsHash.Add(x);
            }); // .Net 4.7.2 - ToHashSet will reduce this code to a single line.

            SetAsHash.ExceptWith(Reference); // This is ultimately what we're testing

            return SetAsHash.AsEnumerable();
        }

        static IEnumerable<int> deduplicationFunction3(List<int> Set, List<int> Reference)
        {
            Set.Sort();
            Reference.Sort();
            Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.

            return deduplicationFunction4(Set, Reference);
        }

        static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
        {
            int i1 = 0;
            int i2 = 0;
            int thisValue = Set[i1];
            int thisReference = Reference[i2];
            while (true)
            {
                var difference = thisReference - thisValue;

                if (difference < 0)
                {
                    i2++; //Compare side is too low, there might be an equal value to be found
                    if (i2 == Reference.Count)
                        break;
                    thisReference = Reference[i2];
                    continue;
                }

                if (difference > 0) //Duplicate
                    yield return thisValue;

                GoFurther:
                i1++;
                if (i1 == Set.Count)
                    break;
                if (Set[i1] == thisValue) //Eliminates duplicates
                    goto GoFurther; //I rarely use goto statements, but this is a good situation

                thisValue = Set[i1];
            }
        }

        static IEnumerable<int> deduplicationFunction5(List<int> Set, List<int> Reference)
        {
            var found = false;
            var lastValue = 0;
            var thisValue = 0;
            for (int i = 0; i < Set.Count; i++)
            {
                thisValue = Set[i];

                if (thisValue == lastValue)
                    continue;

                lastValue = thisValue;

                found = false;
                for (int x = 0; x < Reference.Count; x++)
                {
                    if (thisValue != Reference[x])
                        continue;

                    found = true;
                    break;
                }

                if (found)
                    continue;

                yield return thisValue;
            }
        }
    }
}

I'll use this to compare performance of multiple approaches. 我将使用它来比较多种方法的性能。 (I'm particularly interested in Hash-approach vs dual-index-on-sorted-approach at this stage, although ExceptWith enables a terse solution) (尽管ExceptWith启用了简洁的解决方案,但ExceptWith在此阶段对哈希方法与排序时双索引方法特别感兴趣)

Results so far on 10k items in set (Good Run): 到目前为止,一组结果中有1万个项目(良好运行):

First Run 首轮

  • Dictionary and Where - Count: 3565, Milliseconds: 16.38. 字典和位置-计数:3565,毫秒:16.38。
  • HashSet ExceptWith - Count: 3565, Milliseconds: 5.33. HashSet ExceptWith-计数:3565,毫秒:5.33。
  • Sort Dual Index - Count: 3565, Milliseconds: 6.34. 排序双重索引-计数:3565,毫秒:6.34。
  • Presorted Dual Index - Count: 3565, Milliseconds: 1.14. 预排序双重索引-计数:3565,毫秒:1.14。
  • Nested Index - Count: 3565, Milliseconds: 964.16. 嵌套索引-计数:3565,毫秒:964.16。

Good Run 好运

  • Dictionary and Where - Count: 3565, Milliseconds: 1.21. 字典和位置-计数:3565,毫秒:1.21。
  • HashSet ExceptWith - Count: 3565, Milliseconds: 0.94. HashSet ExceptWith-计数:3565,毫秒:0.94。
  • Sort Dual Index - Count: 3565, Milliseconds: 1.09. 排序双重索引-计数:3565,毫秒:1.09。
  • Presorted Dual Index - Count: 3565, Milliseconds: 0.76. 预分类双索引-计数:3565,毫秒:0.76。
  • Nested Index - Count: 3565, Milliseconds: 628.60. 嵌套索引-计数:3565,毫秒:628.60。

Chosen answer: 选择答案:

  • @backs HashSet.ExceptWith approach - is marginally faster with minimal code, uses an interesting function ExceptWith , however it is weakened due to lack of versatility, and the fact the interesting function is less commonly known. @backs HashSet.ExceptWith方法-用最少的代码稍微快一点,使用有趣的函数ExceptWith ,但是由于缺乏通用性而使其功能减弱,并且有趣的函数鲜为人知。
  • One of my answers: HashSet > Where(..Contains..) - is only a tiny bit slower than @backs, but uses a code pattern that uses LINQ and is very versitile beyond lists of primative elements. 我的答案之一:HashSet> Where(.. Contains ..)-仅比@backs慢一点,但使用的代码模式使用LINQ,并且在原始元素列表之外非常通用。 I believe this is a more common scenario I find myself with when coding, and trust this is the case for many other coders. 我相信这是我在编码时遇到的更常见的情况,并且相信许多其他编码人员都是这种情况。
  • Special thanks to @TheGeneral for his benchmarking of some of the answers and also some interesting unsafe versions, and for helping to make @Backs answer more efficient for a followup test. 特别感谢@TheGeneral对一些答案以及一些有趣的unsafe版本进行了基准测试,并帮助@Backs提高了后续测试的答案的效率。

Use HashSet for your initial list and ExceptWith method to get result sett: 使用HashSet作为初始列表,并使用ExceptWith方法获取结果设置:

var setToDeduplicate = new HashSet<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M 

var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M

setToDeduplicate.ExceptWith(referenceSet);

Here are some more, basically i wanted to test both distinct and not distinct input against a variety of solutions. 这里还有更多内容,基本上我想针对各种解决方案测试不同的输入。 In the non distinct version i had to call distinct where needed on the final output. 在非区别版本中,我必须在最终输出中需要的地方调用区别。

Mode             : Release (64Bit)
Test Framework   : .NET Framework 4.7.1

Operating System : Microsoft Windows 10 Pro
Version          : 10.0.17134

CPU Name         : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Description      : Intel64 Family 6 Model 58 Stepping 9

Cores (Threads)  : 4 (8)      : Architecture  : x64
Clock Speed      : 3901 MHz   : Bus Speed     : 100 MHz
L2Cache          : 1 MB       : L3Cache       : 8 MB

Benchmarks Runs : Inputs (1) * Scales (5) * Benchmarks (6) * Runs (100) = 3,000

Results Distinct input 结果不同的输入

--- Random Set 1 ---------------------------------------------------------------------
| Value         |   Average |   Fastest |      Cycles |    Garbage | Test |     Gain |
--- Scale 100 --------------------------------------------------------- Time 0.334 ---
| Backs         |  0.008 ms |  0.007 ms |      31,362 |   8.000 KB | Pass |  68.34 % |
| ListUnsafe    |  0.009 ms |  0.008 ms |      35,487 |   8.000 KB | Pass |  63.45 % |
| HasSet        |  0.012 ms |  0.011 ms |      46,840 |   8.000 KB | Pass |  50.03 % |
| ArrayUnsafe   |  0.013 ms |  0.011 ms |      49,388 |   8.000 KB | Pass |  47.75 % |
| HashSetUnsafe |  0.018 ms |  0.013 ms |      66,866 |  16.000 KB | Pass |  26.62 % |
| Todd          |  0.024 ms |  0.019 ms |      90,763 |  16.000 KB | Base |   0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.377 ---
| Backs         |  0.070 ms |  0.060 ms |     249,374 |  28.977 KB | Pass |  57.56 % |
| ListUnsafe    |  0.078 ms |  0.067 ms |     277,080 |  28.977 KB | Pass |  52.67 % |
| HasSet        |  0.093 ms |  0.083 ms |     329,686 |  28.977 KB | Pass |  43.61 % |
| ArrayUnsafe   |  0.096 ms |  0.082 ms |     340,154 |  36.977 KB | Pass |  41.72 % |
| HashSetUnsafe |  0.103 ms |  0.085 ms |     367,681 |  55.797 KB | Pass |  37.07 % |
| Todd          |  0.164 ms |  0.151 ms |     578,933 | 112.664 KB | Base |   0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.965 ---
| ListUnsafe    |  0.706 ms |  0.611 ms |   2,467,327 | 258.516 KB | Pass |  48.60 % |
| Backs         |  0.758 ms |  0.654 ms |   2,656,610 | 180.297 KB | Pass |  44.81 % |
| ArrayUnsafe   |  0.783 ms |  0.696 ms |   2,739,156 | 276.281 KB | Pass |  43.02 % |
| HasSet        |  0.859 ms |  0.752 ms |   2,999,230 | 198.063 KB | Pass |  37.47 % |
| HashSetUnsafe |  0.864 ms |  0.783 ms |   3,029,086 | 332.273 KB | Pass |  37.07 % |
| Todd          |  1.373 ms |  1.251 ms |   4,795,929 | 604.742 KB | Base |   0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 5.535 ---
| ListUnsafe    |  5.624 ms |  4.874 ms |  19,658,154 |   2.926 MB | Pass |  40.36 % |
| HasSet        |  7.574 ms |  6.548 ms |  26,446,193 |   2.820 MB | Pass |  19.68 % |
| Backs         |  7.585 ms |  5.634 ms |  26,303,794 |   2.009 MB | Pass |  19.57 % |
| ArrayUnsafe   |  8.287 ms |  6.219 ms |  28,923,797 |   3.583 MB | Pass |  12.12 % |
| Todd          |  9.430 ms |  7.326 ms |  32,880,985 |   2.144 MB | Base |   0.00 % |
| HashSetUnsafe |  9.601 ms |  7.859 ms |  32,845,228 |   5.197 MB | Pass |  -1.81 % |
--- Scale 1,000,000 -------------------------------------------------- Time 47.652 ---
| ListUnsafe    | 57.751 ms | 44.734 ms | 201,477,028 |  29.309 MB | Pass |  22.14 % |
| Backs         | 65.567 ms | 49.023 ms | 228,772,283 |  21.526 MB | Pass |  11.61 % |
| HasSet        | 73.163 ms | 56.799 ms | 254,703,994 |  25.904 MB | Pass |   1.36 % |
| Todd          | 74.175 ms | 53.739 ms | 258,760,390 |   9.144 MB | Base |   0.00 % |
| ArrayUnsafe   | 86.530 ms | 67.803 ms | 300,374,535 |  13.755 MB | Pass | -16.66 % |
| HashSetUnsafe | 97.140 ms | 77.844 ms | 337,639,426 |  39.527 MB | Pass | -30.96 % |
--------------------------------------------------------------------------------------

Results Random List using Distinct on results where needed 结果随机列表,在需要的地方使用不同的结果

--- Random Set 1 ---------------------------------------------------------------------
| Value         |    Average |   Fastest |      Cycles |    Garbage | Test |    Gain |
--- Scale 100 --------------------------------------------------------- Time 0.272 ---
| Backs         |   0.007 ms |  0.006 ms |      28,449 |   8.000 KB | Pass | 72.96 % |
| HasSet        |   0.010 ms |  0.009 ms |      38,222 |   8.000 KB | Pass | 62.05 % |
| HashSetUnsafe |   0.014 ms |  0.010 ms |      51,816 |  16.000 KB | Pass | 47.52 % |
| ListUnsafe    |   0.017 ms |  0.014 ms |      64,333 |  16.000 KB | Pass | 33.84 % |
| ArrayUnsafe   |   0.020 ms |  0.015 ms |      72,468 |  16.000 KB | Pass | 24.70 % |
| Todd          |   0.026 ms |  0.021 ms |      95,500 |  24.000 KB | Base |  0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.361 ---
| Backs         |   0.061 ms |  0.053 ms |     219,141 |  28.977 KB | Pass | 70.46 % |
| HasSet        |   0.092 ms |  0.080 ms |     325,353 |  28.977 KB | Pass | 55.78 % |
| HashSetUnsafe |   0.093 ms |  0.079 ms |     331,390 |  55.797 KB | Pass | 55.03 % |
| ListUnsafe    |   0.122 ms |  0.101 ms |     432,029 |  73.016 KB | Pass | 41.19 % |
| ArrayUnsafe   |   0.133 ms |  0.113 ms |     469,560 |  73.016 KB | Pass | 35.88 % |
| Todd          |   0.208 ms |  0.173 ms |     730,661 | 148.703 KB | Base |  0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.870 ---
| Backs         |   0.620 ms |  0.579 ms |   2,174,415 | 180.188 KB | Pass | 55.31 % |
| HasSet        |   0.696 ms |  0.635 ms |   2,440,300 | 198.063 KB | Pass | 49.87 % |
| HashSetUnsafe |   0.731 ms |  0.679 ms |   2,563,125 | 332.164 KB | Pass | 47.32 % |
| ListUnsafe    |   0.804 ms |  0.761 ms |   2,818,293 | 400.492 KB | Pass | 42.11 % |
| ArrayUnsafe   |   0.810 ms |  0.751 ms |   2,838,680 | 400.492 KB | Pass | 41.68 % |
| Todd          |   1.388 ms |  1.271 ms |   4,863,651 | 736.953 KB | Base |  0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 6.616 ---
| Backs         |   5.604 ms |  4.710 ms |  19,600,934 |   2.009 MB | Pass | 62.92 % |
| HasSet        |   6.607 ms |  5.847 ms |  23,093,963 |   2.820 MB | Pass | 56.29 % |
| HashSetUnsafe |   8.565 ms |  7.465 ms |  29,239,067 |   5.197 MB | Pass | 43.34 % |
| ListUnsafe    |  11.447 ms |  9.543 ms |  39,452,865 |   5.101 MB | Pass | 24.28 % |
| ArrayUnsafe   |  11.517 ms |  9.841 ms |  39,731,502 |   5.483 MB | Pass | 23.81 % |
| Todd          |  15.116 ms | 11.369 ms |  51,963,309 |   3.427 MB | Base |  0.00 % |
--- Scale 1,000,000 -------------------------------------------------- Time 55.310 ---
| Backs         |  53.766 ms | 44.321 ms | 187,905,335 |  21.526 MB | Pass | 51.32 % |
| HasSet        |  60.759 ms | 50.742 ms | 212,409,649 |  25.904 MB | Pass | 44.99 % |
| HashSetUnsafe |  79.248 ms | 67.130 ms | 275,455,545 |  39.527 MB | Pass | 28.25 % |
| ListUnsafe    | 106.527 ms | 90.159 ms | 370,838,650 |  39.153 MB | Pass |  3.55 % |
| Todd          | 110.444 ms | 93.225 ms | 384,636,081 |  22.676 MB | Base |  0.00 % |
| ArrayUnsafe   | 114.548 ms | 98.033 ms | 398,219,513 |  38.974 MB | Pass | -3.72 % |
--------------------------------------------------------------------------------------

Data 数据

private Tuple<List<int>, List<int>> GenerateData(int scale)
{
   return new Tuple<List<int>, List<int>>(
      Enumerable.Range(0, scale)
                .Select(x => x)
                .ToList(),
      Enumerable.Range(0, scale)
                .Select(x => Rand.Next(10000))
                .ToList());
}

Code

public class Backs : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var hashSet = new HashSet<int>(Input.Item1); 
      hashSet.ExceptWith(Input.Item2); 
      return hashSet.ToList(); 
   }
}

public class HasSet : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{

   protected override List<int> InternalRun()
   {
      var hashSet = new HashSet<int>(Input.Item2); 

      return Input.Item1.Where(y => !hashSet.Contains(y)).ToList(); 
   }
}

public class Todd : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var referenceHashSet = Input.Item2.Distinct()                 
                                      .ToDictionary(x => x, x => x);

      return Input.Item1.Where(y => !referenceHashSet.TryGetValue(y, out _)).ToList();
   }
}

public unsafe class HashSetUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new HashSet<int>();
      fixed (int* pAry = Input.Item1.ToArray())
      {
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               result.Add(*p);
         }
      }
      return result.ToList(); 
   }
}
public unsafe class ListUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new List<int>(Input.Item2.Count);

      fixed (int* pAry = Input.Item1.ToArray())
      {
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               result.Add(*p);
         }
      }
      return result.ToList(); 
   }
}

public unsafe class ArrayUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new int[Input.Item1.Count];

      fixed (int* pAry = Input.Item1.ToArray(), pRes = result)
      {
         var j = 0;
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               *(pRes+j++) = *p;
         }
         return result.Take(j).ToList(); 
      }

   }
}

Summary 摘要

No surprises here really, if you have a distinct list to start with its better for some solutions, If not the simplest hashset version is the best 如果您有一个独特的列表以更好地解决某些问题,那么在这里真的就不足为奇了,如果不是,最简单的哈希集版本是最好的

Single loop Dual-Index 单循环双索引

As recommended by @PepitoSh in the Question comments: 如@PepitoSh在问题注释中所建议:

I think HashSet is a very generic solution to a rather specific problem. 我认为HashSet是针对特定问题的非常通用的解决方案。 If your lists are ordered, scanning them parallel and compare the current items is the fastest 如果您的列表是有序的,则并行扫描它们并比较当前项目是最快的

This is very different to having two nested loops. 这与具有两个嵌套循环非常不同。 Instead there is a single general loop and the indexes are incremented ascending in parallel, depending on the relative value difference. 相反,只有一个通用循环,并且索引会根据相对值的差异并行递增。 The difference is basically the output of any normal Comparison function: { negative, 0, positive } 区别基本上是任何常规比较函数的输出:{negative,0,positive}

static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
    int i1 = 0;
    int i2 = 0;
    int thisValue = Set[i1];
    int thisReference = Reference[i2];
    while (true)
    {
        var difference = thisReference - thisValue;

        if (difference < 0)
        {
            i2++; //Compare side is too low, there might be an equal value to be found
            if (i2 == Reference.Count)
                break;
            thisReference = Reference[i2];
            continue;
        }

        if (difference > 0) //Duplicate
            yield return thisValue;

        GoFurther:
        i1++;
        if (i1 == Set.Count)
            break;
        if (Set[i1] == thisValue) //Eliminates duplicates
            goto GoFurther; //I rarely use goto statements, but this is a good situation

        thisValue = Set[i1];
    }
}

How to call this function, if the lists aren't yet sorted: 如果列表尚未排序,如何调用此函数:

Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.

return deduplicationFunction4(Set, Reference);

This gave me the best performance in my benchmarking. 这使我在基准测试中表现最佳。 This could probably also be tried with unsafe code for more of a speedup in some scenarios. 在某些情况下,也可以使用不安全的代码尝试这种方法,以提高速度。 In scenarios where data is already sorted, this is by far the best. 在已经对数据进行排序的情况下,这是迄今为止最好的。 A faster sorting algorithm might also be selected, but not the subject of this question. 也可以选择一种更快的排序算法,但不是这个问题的主题。

Note: This method deduplicates as it goes. 注意:此方法会进行重复数据删除。

I have actually coded such a single loop pattern before when finalising text-search results, except I had N arrays to check for "closeness". 在最终确定文本搜索结果之前,我实际上已经编码了这样的单循环模式,除了我有N个数组来检查“紧密度”。 So I had an array of indexes - array[index[i]] . 所以我有一个索引array[index[i]] So I'm sure having a single loop with controlled index incrementing isn't a new concept, but it's certainly a great solution here. 因此,我敢肯定有一个具有受控索引增量的单循环不是一个新概念,但在这里肯定是一个不错的解决方案。

HashSet and Where HashSet和位置

You must use a HashSet (or Dictionary) for speed: 您必须使用HashSet(或Dictionary)来提高速度:

//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = Reference
                        .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                        .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

    int throwAway;
        return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));
}

That's a lambda expression version. 那是lambda表达式版本。 It uses Dictionary which provides adaptability for varying the value if needed. 它使用Dictionary(字典),该字典可根据需要提供用于更改值的适应性。 Literal for-loops could be used and perhaps some more incremental performance improvement gained, but relative to having two-nested-loops, this is already an amazing improvement. 可以使用文字for循环,也许可以获得更多的增量性能改进,但是相对于具有两个嵌套的循环,这已经是一个了不起的改进。

Learning a few things while looking at other answers, here is a faster implementation: 在学习其他答案的同时学习一些东西,这是一个更快的实现:

static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference);
    return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();
}

Importantly, this approach (while a tiny bit slower than @backs answer) is still versatile enough to use for database entities, AND other types can easily be used on the duplicate check field. 重要的是,这种方法(虽然比@backs答案慢一点点)仍然足够通用,可以用于数据库实体,并且可以在重复检查字段中轻松使用其他类型。

Here's an example how the code is easily adjusted for use with a Person kind of database entity list. 这是一个示例,如何轻松地调整代码以与Person类型的数据库实体列表一起使用。

static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
    return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
            .GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM