简体   繁体   English

检查 IEnumerable 的快速方法<T>不包含重复项(= 是不同的)

[英]Fast way to check if IEnumerable<T> contains no duplicates (= is distinct)

Is there a fast built-in way to check if an IEnumerable<string> contains only distinct strings?是否有一种快速的内置方法来检查IEnumerable<string>只包含不同的字符串?

In the beginning I started with:一开始我是这样开始的:

var enumAsArray = enum.ToArray();
if (enumAsArray.Length != enumAsArray.Distinct().Count())
    throw ...

However, this looks like it is O(2n) - is it?但是,这看起来像是 O(2n) - 是吗? ToArray() might be O(1)? ToArray()可能是 O(1)?

This looks faster:这看起来更快:

var set = new HashSet<string>();
foreach (var str in enum)
{
    if (!set.Add(str))
        throw ...
}

This should be O(n), however, is there a built-in way too?这应该是 O(n),但是,是否也有内置方法?

Edit: Maybe Distinct() uses this internally?编辑:也许 Distinct() 在内部使用它?


Solution: After considering all the comments and the answer, I wrote an extension method for my second solution, as this seems to be the fastest version and the most readable too:解决方案:在考虑了所有评论和答案之后,我为我的第二个解决方案编写了一个扩展方法,因为这似乎是最快的版本,也是最易读的:

public static bool ContainsDuplicates<T>(this IEnumerable<T> e)
{
    var set = new HashSet<T>();
    // ReSharper disable LoopCanBeConvertedToQuery
    foreach (var item in e)
    // ReSharper restore LoopCanBeConvertedToQuery
    {
        if (!set.Add(item))
            return true;
    }
    return false;
}

Your second code sample is short, simple, clearly effective, and if not the completely perfect ideal solution, is clearly rather close to it.你的第二个代码示例简短、简单、明显有效,如果不是完全完美的理想解决方案,显然也很接近它。 It seems like a perfectly acceptable solution to your particular problems.对于您的特定问题,这似乎是一个完全可以接受的解决方案。

Unless your use of that particular solution is shown to cause performance problems after you've noticed issues and done performance testing, I'd leave it as is.除非在您发现问题并完成性能测试后,您使用该特定解决方案会导致性能问题,否则我会保持原样。 Given how little room I can see for improvement in general, that doesn't seem likely.鉴于我看到的总体改进空间很小,这似乎不太可能。 It's not a sufficiently lengthy or complex solution that trying to find something "shorter" or more concise is going to be worth your time and effort.这不是一个足够冗长或复杂的解决方案,试图找到“更短”或更简洁的东西值得您花费时间和精力。

In short, there are almost certainly better places in your code to spend your time;简而言之,您的代码中几乎肯定有更好的地方可以花时间; what you have already is fine.你已经拥有的一切都很好。

To answer your specific questions:要回答您的具体问题:

  1. However, this looks like it is O(2n) - is it?但是,这看起来像是 O(2n) - 是吗?

    Yes, it is.是的。

  2. ToArray() might be O(1)? ToArray()可能是 O(1)?

    No, it's not.不,这不对。

  3. Maybe Distinct() uses this internally?也许Distinct()在内部使用它?

    It does use a HashSet , and it looks pretty similar, but it simply ignores duplicate items;它确实使用了一个HashSet ,它看起来非常相似,但它只是忽略了重复项; it doesn't provide any indication to the caller that it has just passed a duplicate item.它不会向调用者提供任何关于它刚刚传递了重复项的指示。 As a result, you need to iterate the whole sequence twice to see if it removed anything, rather than stopping when the first duplicate is encountered.因此,您需要迭代整个序列两次以查看它是否删除了任何内容,而不是在遇到第一个重复项时停止。 This is the difference between something that always iterates the full sequence twice and something that might iterate the full sequence once, but can short circuit and stop as soon as it has ensured an answer.这就是总是迭代完整序列两次的事物与可能迭代完整序列一次的事物之间的区别,但一旦确定答案就可以短路并停止。

  4. is there a built-in way too?是否也有内置方式?

    Well, you showed one, it's just not as efficient.好吧,你展示了一个,只是效率不高。 I can think of no entire LINQ based solution as efficient as what you showed.我认为没有一个完整的基于 LINQ 的解决方案像您展示的那样有效。 The best I can think of would be: data.Except(data).Any() .我能想到的最好的方法是: data.Except(data).Any() This is a bit better than your distinct compared to the regular count in that the second iteration can short circuit (but not the first) but it also iterates the sequence twice, and still is worse than your non-LINQ solution, so it's still not worth using.与常规计数相比,这比您的不同要好一点,因为第二次迭代可以短路(但不是第一次),但它也会对序列进行两次迭代,并且仍然比您的非 LINQ 解决方案差,所以它仍然不是值得使用。

Here is a possible refinement to the OP's answer:以下是对 OP 答案的可能改进:

public static IEnumerable<T> Duplicates<T>(this IEnumerable<T> e)
{
    var set = new HashSet<T>();
    // ReSharper disable LoopCanBeConvertedToQuery
    foreach (var item in e)
    // ReSharper restore LoopCanBeConvertedToQuery
    {
        if (!set.Add(item))
            yield return item;
    }
}

You now have a potentially useful method to get the actual duplicate items and you can answer your original question with:您现在有一个潜在有用的方法来获取实际的重复项,您可以通过以下方式回答您的原始问题:

collection.Duplicates().Any()

Just a complement to the existing solution:只是对现有解决方案的补充:

public static bool ContainsDuplicates<T>(this IEnumerable<T> items)
{
    return ContainsDuplicates(items, EqualityComparer<T>.Default);
}

public static bool ContainsDuplicates<T>(this IEnumerable<T> items, IEqualityComparer<T> equalityComparer)
{
    var set = new HashSet<T>(equalityComparer);

    foreach (var item in items)
    {
        if (!set.Add(item))
            return true;
    }

    return false;
}

This version lets you pick an equality comparer, this may prove useful if you want to compare items based on non-default rules.这个版本让你选择一个相等比较器,如果你想根据非默认规则比较项目,这可能会很有用。

For instance, to compare aa set of strings case insensitively, just pass it StringComparer.OrdinalIgnoreCase .例如,要不区分大小写地比较一组字符串,只需传递它StringComparer.OrdinalIgnoreCase

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM