简体   繁体   English

我什么时候应该使用 HashSet<T> 类型?

[英]When should I use the HashSet<T> type?

I am exploring the HashSet<T> type, but I don't understand where it stands in collections.我正在探索HashSet<T>类型,但我不明白它在集合中的位置。

Can one use it to replace a List<T> ?可以用它来代替List<T>吗? I imagine the performance of a HashSet<T> to be better, but I couldn't see individual access to its elements.我想象HashSet<T>的性能会更好,但我看不到对其元素的单独访问。

Is it only for enumeration?是否仅用于枚举?

The important thing about HashSet<T> is right there in the name: it's a set .关于HashSet<T>的重要之处就在名称中:它是一个set The only things you can do with a single set is to establish what its members are, and to check whether an item is a member.您可以对单个集合做的唯一事情是确定其成员是什么,并检查项目是否是成员。

Asking if you can retrieve a single element (eg set[45] ) is misunderstanding the concept of the set.询问您是否可以检索单个元素(例如set[45] )是对集合概念的误解。 There's no such thing as the 45th element of a set.没有集合的第 45 个元素这样的东西。 Items in a set have no ordering.集合中的项目没有顺序。 The sets {1, 2, 3} and {2, 3, 1} are identical in every respect because they have the same membership, and membership is all that matters.集合 {1, 2, 3} 和 {2, 3, 1} 在各方面都是相同的,因为它们具有相同的成员资格,而成员资格才是最重要的。

It's somewhat dangerous to iterate over a HashSet<T> because doing so imposes an order on the items in the set.迭代HashSet<T>有点危险,因为这样做会对集合中的项目强加一个顺序。 That order is not really a property of the set.该顺序并不是该集合的真正属性。 You should not rely on it.你不应该依赖它。 If ordering of the items in a collection is important to you, that collection isn't a set.如果集合中项目的排序对您很重要,则该集合不是集合。

Sets are really limited and with unique members.套装真的很有限,而且有独特的成员。 On the other hand, they're really fast.另一方面,他们真的很快。

Here's a real example of where I use a HashSet<string> :这是我使用HashSet<string>的真实示例:

Part of my syntax highlighter for UnrealScript files is a new feature that highlights Doxygen-style comments .我的 UnrealScript 文件语法高亮器的一部分是一个高亮 Doxygen 样式注释的新功能。 I need to be able to tell if a @ or \\ command is valid to determine whether to show it in gray (valid) or red (invalid).我需要能够判断@\\命令是否有效,以确定是以灰色(有效)还是红色(无效)显示它。 I have a HashSet<string> of all the valid commands, so whenever I hit a @xxx token in the lexer, I use validCommands.Contains(tokenText) as my O(1) validity check.我有一个包含所有有效命令的HashSet<string> ,所以每当我在词法分析器中遇到@xxx标记时,我都会使用validCommands.Contains(tokenText)作为我的 O(1) 有效性检查。 I really don't care about anything except existence of the command in the set of valid commands.除了有效命令集中该命令的存在之外,我真的不关心任何事情。 Lets look at the alternatives I faced:让我们看看我面临的替代方案:

  • Dictionary<string, ?> : What type do I use for the value? Dictionary<string, ?> :我使用什么类型的值? The value is meaningless since I'm just going to use ContainsKey .该值毫无意义,因为我将仅使用ContainsKey Note: Before .NET 3.0 this was the only choice for O(1) lookups - HashSet<T> was added for 3.0 and extended to implement ISet<T> for 4.0.注意:在 .NET 3.0 之前,这是 O(1) 查找的唯一选择 - 为 3.0 添加了HashSet<T>并扩展到为 4.0 实现ISet<T>
  • List<string> : If I keep the list sorted, I can use BinarySearch , which is O(log n) (didn't see this fact mentioned above). List<string> :如果我保持列表排序,我可以使用BinarySearch ,它是 O(log n) (没有看到上面提到的这个事实)。 However, since my list of valid commands is a fixed list that never changes, this will never be more appropriate than simply...但是,由于我的有效命令列表是一个永远不会更改的固定列表,因此这永远不会比简单地更合适......
  • string[] : Again, Array.BinarySearch gives O(log n) performance. string[] :再次, Array.BinarySearch提供 O(log n) 性能。 If the list is short, this could be the best performing option.如果列表很短,这可能是性能最好的选项。 It always has less space overhead than HashSet , Dictionary , or List .它的空间开销总是比HashSetDictionaryList Even with BinarySearch , it's not faster for large sets, but for small sets it'd be worth experimenting.即使使用BinarySearch ,它对于大集合也不是更快,但对于小集合,它值得尝试。 Mine has several hundred items though, so I passed on this.我的有几百件物品,所以我传递了这个。

A HashSet<T> implements the ICollection<T> interface: HashSet<T>实现了ICollection<T>接口:

public interface ICollection<T> : IEnumerable<T>, IEnumerable
{
    // Methods
    void Add(T item);
    void Clear();
    bool Contains(T item);
    void CopyTo(T[] array, int arrayIndex);
    bool Remove(T item);

    // Properties
   int Count { get; }
   bool IsReadOnly { get; }
}

A List<T> implements IList<T> , which extends the ICollection<T> List<T>实现了IList<T> ,它扩展了ICollection<T>

public interface IList<T> : ICollection<T>
{
    // Methods
    int IndexOf(T item);
    void Insert(int index, T item);
    void RemoveAt(int index);

    // Properties
    T this[int index] { get; set; }
}

A HashSet has set semantics, implemented via a hashtable internally: HashSet 具有设置语义,通过内部哈希表实现:

A set is a collection that contains no duplicate elements, and whose elements are in no particular order.集合是不包含重复元素且其元素没有特定顺序的集合。

What does the HashSet gain, if it loses index/position/list behavior?如果 HashSet 失去索引/位置/列表行为,它会获得什么?

Adding and retrieving items from the HashSet is always by the object itself, not via an indexer, and close to an O(1) operation (List is O(1) add, O(1) retrieve by index, O(n) find/remove).从 HashSet 添加和检索项目总是由对象本身,而不是通过索引器,并且接近 O(1) 操作(列表是 O(1) 添加,O(1) 通过索引检索,O(n) 查找) /消除)。

A HashSet's behavior could be compared to using a Dictionary<TKey,TValue> by only adding/removing keys as values, and ignoring dictionary values themselves.可以将 HashSet 的行为与使用Dictionary<TKey,TValue> ,只需添加/删除键作为值,并忽略字典值本身。 You would expect keys in a dictionary not to have duplicate values, and that's the point of the "Set" part.您希望字典中的键没有重复值,这就是“设置”部分的重点。

Performance would be a bad reason to choose HashSet over List.性能不是选择 HashSet 而不是 List 的坏理由。 Instead, what better captures your intent?相反,有什么能更好地捕捉您的意图? If order is important, then Set (or HashSet) is out.如果顺序很重要,那么 Set(或 HashSet)就出局了。 If duplicates are permitted, likewise.如果允许重复,同样如此。 But there are plenty of circumstances when we don't care about order, and we'd rather not have duplicates - and that's when you want a Set.但是在很多情况下我们不关心顺序,我们宁愿没有重复——这就是你想要一个 Set 的时候。

HashSet is a set implemented by hashing. HashSet 是通过哈希实现的集合 A set is a collection of values containing no duplicate elements.集合是不包含重复元素的值的集合。 The values in a set are also typically unordered.集合中的值通常也是无序的。 So no, a set can not be used to replace a list (unless you should've use a set in the first place).所以不,不能使用集合来替换列表(除非您首先应该使用集合)。

If you're wondering what a set might be good for: anywhere you want to get rid of duplicates, obviously.如果你想知道一个集合有什么好处:显然,你想摆脱重复的任何地方。 As a slightly contrived example, let's say you have a list of 10.000 revisions of a software projects, and you want to find out how many people contributed to that project.作为一个稍微做作的例子,假设您有一个软件项目的 10.000 个修订的列表,并且您想找出有多少人为该项目做出了贡献。 You could use a Set<string> and iterate over the list of revisions and add each revision's author to the set.您可以使用Set<string>并遍历修订列表并将每个修订的作者添加到集合中。 Once you're done iterating, the size of the set is the answer you were looking for.完成迭代后,集合的大小就是您要寻找的答案。

HashSet would be used to remove duplicate elements in an IEnumerable collection. HashSet 将用于删除 IEnumerable 集合中的重复元素。 For example,例如,

List<string> duplicatedEnumrableStrings = new List<string> {"abc", "ghjr", "abc", "abc", "yre", "obm", "ghir", "qwrt", "abc", "vyeu"};
HashSet<string> uniqueStrings = new HashSet(duplicatedEnumrableStrings);

after those codes are run, uniqueStrings holds {"abc", "ghjr", "yre", "obm", "qwrt", "vyeu"};这些代码运行后,uniqueStrings 持有 {"abc", "ghjr", "yre", "obm", "qwrt", "vyeu"};

Probably the most common use for hashsets is to see whether they contain a certain element, which is close to an O(1) operation for them (assuming a sufficiently strong hashing function), as opposed to lists for which check for inclusion is O(n) (and sorted sets for which it is O(log n)).散列集最常见的用途可能是查看它们是否包含某个元素,这对它们来说接近 O(1) 操作(假设散列函数足够强),而不是列表检查是否包含是 O( n)(以及 O(log n) 的排序集)。 So if you do a lot of checks, whether an item is contained in some list, hahssets might be a performance improvement.因此,如果您进行大量检查,某个项目是否包含在某个列表中,hahssets 可能会提高性能。 If you only ever iterate over them, there won't be much difference (iterating over the whole set is O(n), same as with lists and hashsets have somewhat more overhead when adding items).如果您只对它们进行迭代,则不会有太大区别(对整个集合进行迭代是 O(n),与列表和哈希集在添加项目时有更多开销一样)。

And no, you can't index a set, which would not make sense anyway, because sets aren't ordered.不,你不能索引一个集合,这无论如何都没有意义,因为集合不是有序的。 If you add some items, the set won't remember which one was first, and which second etc.如果您添加一些项目,该集合将不会记住哪个是第一个,哪个是第二个等等。

HashSet<T> is a data strucutre in the .NET framework that is a capable of representing a mathematical set as an object. HashSet<T>是 .NET 框架中的一种数据结构,能够将数学集表示为对象。 In this case, it uses hash codes (the GetHashCode result of each item) to compare equality of set elements.在这种情况下,它使用哈希码(每个项目的GetHashCode结果)来比较集合元素的相等性。

A set differs from a list in that it only allows one occurrence of the same element contained within it.集合与列表的不同之处在于它只允许包含在其中的相同元素出现一次。 HashSet<T> will just return false if you try to add a second identical element.如果您尝试添加第二个相同的元素, HashSet<T>只会返回false Indeed, lookup of elements is very quick ( O(1) time), since the internal data structure is simply a hashtable.事实上,元素的查找非常快( O(1)时间),因为内部数据结构只是一个哈希表。

If you're wondering which to use, note that using a List<T> where HashSet<T> is appropiate is not the biggest mistake, though it may potentially allow problems where you have undesirable duplicate items in your collection.如果您想知道要使用哪个,请注意,在HashSet<T>合适的地方使用List<T>并不是最大的错误,尽管它可能会导致您的集合中存在不需要的重复项的问题。 What is more, lookup (item retrieval) is vastly more efficient - ideally O(1) (for perfect bucketing) instead of O(n) time - which is quite important in many scenarios.更重要的是,查找(项目检索)的效率要高得多——理想情况下是O(1) (用于完美的分桶)而不是O(n)时间——这在许多情况下非常重要。

List<T> is used to store ordered sets of information. List<T>用于存储有序的信息集。 If you know the relative order of the elements of the list, you can access them in constant time.如果知道列表元素的相对顺序,就可以在恒定时间内访问它们。 However, to determine where an element lies in the list or to check if it exists in the list, the lookup time is linear.但是,要确定元素在列表中的位置或检查它是否存在于列表中,查找时间是线性的。 On the other hand, HashedSet<T> makes no guarantees of the order of the stored data and consequently provides constant access time for its elements.另一方面, HashedSet<T>不保证存储数据的顺序,因此为其元素提供恒定的访问时间。

As the name implies, HashedSet<T> is a data structure that implements set semantics .顾名思义, HashedSet<T>是一种实现集合语义的数据结构。 The data structure is optimized to implement set operations (ie Union, Difference, Intersect), which can not be done as efficiently with the traditional List implementation.数据结构被优化以实现集合操作(​​即联合、差分、相交),这是传统列表实现无法高效完成的。

So, to choose which data type to use really depends on what your are attempting to do with your application.因此,选择要使用的数据类型实际上取决于您尝试对应用程序执行的操作。 If you don't care about how your elements are ordered in a collection, and only want to enumarate or check for existence, use HashSet<T> .如果您不关心元素在集合中的排序方式,而只想枚举或检查是否存在,请使用HashSet<T> Otherwise, consider using List<T> or another suitable data structure.否则,请考虑使用List<T>或其他合适的数据结构。

In the basic intended scenario HashSet<T> should be used when you want more specific set operations on two collections than LINQ provides.在基本预期场景中,当您想要对两个集合进行比 LINQ 提供的更具体的设置操作时,应该使用HashSet<T> LINQ methods like Distinct , Union , Intersect and Except are enough in most situations, but sometimes you may need more fine-grained operations, and HashSet<T> provides:在大多数情况下,诸如DistinctUnionIntersectExcept类的 LINQ 方法就足够了,但有时您可能需要更细粒度的操作,而HashSet<T>提供了:

  • UnionWith
  • IntersectWith
  • ExceptWith
  • SymmetricExceptWith
  • Overlaps
  • IsSubsetOf
  • IsProperSubsetOf
  • IsSupersetOf
  • IsProperSubsetOf
  • SetEquals

Another difference between LINQ and HashSet<T> "overlapping" methods is that LINQ always returns a new IEnumerable<T> , and HashSet<T> methods modify the source collection. LINQ 和HashSet<T> “重叠”方法之间的另一个区别是 LINQ 总是返回一个新的IEnumerable<T> ,而HashSet<T>方法修改源集合。

简而言之 - 任何时候你想使用字典(或字典,其中 S 是 T 的一个属性),那么你应该考虑一个 HashSet(或 HashSet + 在 T 上实现 IEquatable,它等于 S)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM