简体   繁体   English

找到两个字符串列表之间的区别

[英]Finding the difference between two lists of strings

I'm pretty sure this is a duplicate, but I have tried everything, and I still cannot seem to get the differences. 我很确定这是重复的,但我已经尝试了一切,但我似乎仍然无法得到差异。 I have two lists of strings: listA and listB. 我有两个字符串列表:listA和listB。 I'm trying to find items in listA that are not in B. 我正在尝试查找listA中不在B中的项目。

Example: listA: "1", "2", "4", "7" listB: "2", "4" The output I want is: "1", "7" 示例:listA:“1”,“2”,“4”,“7”listB:“2”,“4”我想要的输出是:“1”,“7”

Here is a for loop and lambda expression that I tried, but these take a really long time: 这是我尝试的for循环和lambda表达式,但这些需要很长时间:

//these two approaches take too long for huge lists

    foreach (var item in listA)
            {
                if (!listB.Contains(item))
                    diff.Add(id);
            }

    diff = listA.Where(id => !listB.Contains(id)).ToList();

//these don't give me the right differences

    listA.Except(listB).ToList();

    var set = new HashSet<string>(listA);
    set.SymmetricExceptWith(listB);

使用LINQ的Except方法:

listA.Except(listB).ToList();
listA.Except(listB).ToList();

should give the correct answer, but 应该给出正确答案,但是

set.SymmetricExceptWith(listB);

should not. 不应该。 SymmetricExcept will give the items in listA not in listB plus the items in ListB not in ListA. SymmetricExcept将给予不listA的在数组listB的项目加上数组listB不利斯塔的项目。

All code you posted should work fine so error is in another place anyway you write "these take a really long time " then I suppose you have a performance issue. 您发布的所有代码都应该正常工作,所以错误是在另一个地方,无论如何你写“这需要很长时间 ”然后我认为你有性能问题。

Let's do a very quick and dirty comparison (you know to do a good performance test is a long process, self-promotion: benchmark has been done with this free tool ). 让我们做一个非常快速和肮脏的比较(你知道做一个好的性能测试是一个漫长的过程,自我推销:基准已经使用这个免费工具完成 )。 Assumptions: 假设:

  • Lists are unordered. 列表是无序的。
  • There may be duplicates in our inputs but we don't want duplicates in result. 我们的输入中可能存在重复项,但我们不希望在结果中出现重复项。
  • Second list is always a subset of first list (assumed because you're using SymmetricExceptWith and if not then its result is pretty different compared to Except ). 第二个列表始终是第一个列表的子集(假设因为您使用的是SymmetricExceptWith ,如果没有,则其结果与Except相比非常不同)。 If it was a mistake just ignore tests for SymmetricExceptWith . 如果是错误,只需忽略SymmetricExceptWith测试。

Two lists of 20,000 random items (test repeated 100 times then averaged, release mode). 两个20,000个随机项目列表(测试重复100次,然后平均,发布模式)。

Method                  Time [ms]
Contains *1                  49.4
Contains *2                  49.0
Except                        5.9
SymmetricExceptWith *3        4.1
SymmetricExceptWith *4        2.5

Notes: 笔记:

1 Loop with foreach 1与foreach循环
2 Loop with for 2循环for for
3 Hashset creation measured 3测量哈希集创建
4 Hashset creation not measured. 4未测量哈希集创建。 I included this for reference but if you don't have first list as Hashset you can't ignore creation time. 我将其包含在内作为参考,但如果您没有第一个列表作为Hashset,则不能忽略创建时间。

We see Contains() method is pretty slow so we can drop it in bigger tests (anyway I checked and its performance won't become better or even comparable). 我们看到Contains()方法非常慢,所以我们可以把它放在更大的测试中(无论如何我检查过它的性能不会变得更好甚至可比)。 Let's see what will happen for 1,000,000 items list. 让我们看看1,000,000项目列表会发生什么。

Method                        Time [ms]
Except                            244.4
SymmetricExceptWith               259.0

Let's try to make it parallel (please note that for this test I'm using a old Core 2 Duo 2 GHz): 让我们试着让它并行 (请注意,对于这个测试,我使用的是旧的Core 2 Duo 2 GHz):

Method                        Time [ms]
Except                            244.4
SymmetricExceptWith               259.0
Except (parallel partitions)      301.8
SymmetricExceptWith (p. p.)       382.6
Except (AsParallel)               274.4

Parallel performance are worse and LINQ Except is best option now. 并行性能更差,LINQ Except现在是最佳选择。 Let's see how it works on a better CPU (Xeon 2.8 GHz, quad core). 让我们看看它如何在更好的CPU (Xeon 2.8 GHz,四核)上运行。 Also note that with such big amount of data cache size won't affect testing too much. 另请注意,如此大量的数据缓存大小不会影响测试太多。

Method                        Time [ms]
Except                            127.4
SymmetricExceptWith               149.2
Except (parallel partitions)      208.0
SymmetricExceptWith (p. p.)       170.0
Except (AsParallel)                80.2

To summarize: for relatively small lists SymmetricExceptWith() will perform better, for big lists Except() is always better. 总结一下:对于相对较小的列表, SymmetricExceptWith()会更好,对于大型列表, Except()总是更好。 If you're targeting a modern multi-core CPU then parallel implementation will scale much better. 如果您的目标是现代多核CPU,那么并行实现将会更好地扩展 In code: 在代码中:

var c = a.Except(b).ToList();
var c = a.AsParallel().Except(b.AsParallel()).ToList();

Please note that if you don't need List<string> as result and IEnumerable<string> is enough then performance will greatly increase (and difference with parallel execution will be higher). 请注意,如果您不需要List<string>作为结果且IEnumerable<string>就足够了,那么性能将大大提高(并行执行的差异会更大)。

Of course those two lines of code are not optimal and can be greatly increase (and if it's really performance critical you may pick ParallelEnumerable.Except() implementation as starting point for your own specific highly optimized routine). 当然,这两行代码并不是最优的,并且可以大大增加(如果它确实对性能至关重要,您可以选择ParallelEnumerable.Except()实现作为您自己特定的高度优化例程的起点)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM