从 HashSet 中选择“任何”项目非常慢 - 如何快速做到这一点？

Question

I'm doing a lot of work at the moment with greedy algorithms - they don't care about indexing etc, they only work on groups/sets.我目前正在使用贪婪算法做很多工作 - 他们不关心索引等，他们只在组/集合上工作。 But I'm finding that 85% of my execution time is spent trying to pick an item out of a HashSet.但我发现 85% 的执行时间都花在了从 HashSet 中挑选项目上。

According to MSDN docs:根据 MSDN 文档：

The HashSet class provides high-performance set operations. HashSet 类提供高性能的集合操作。 A set is a collection that contains no duplicate elements, and whose elements are in no particular order.集合是不包含重复元素且其元素没有特定顺序的集合。

...and that appears to be true for everything EXCEPT "fetching a value". ...这似乎适用于除“获取值”之外的所有内容。

I've tried:我试过了：

ElementAt(0) - extremely slow, allegedly because HashSet goes and generates an adhoc ordering, sorts everything, then returns whatever was first ElementAt(0) - 非常慢，据称是因为 HashSet 会生成一个临时排序，对所有内容进行排序，然后返回最先出现的内容
First - extremely slow (presumably it's doing the same thing)首先 - 非常慢（大概它在做同样的事情）

What I expected:我的期望：

AnyItem() - returns an item from the HashSet, with no guarantees. AnyItem() - 从 HashSet 返回一个项目，没有保证。 Could be first, could be last, could be random (but don't count on that).可能是第一个，可能是最后一个，可能是随机的（但不要指望那个）。

...but I can't seem to figure out a way of doing that. ......但我似乎无法想出一种方法来做到这一点。

EDIT2: slightly more detailed source code, to make it clear why some trivial workarounds for this missing feature in HashSet don't help, and hopefully to show better why HashSet is the right class in all other ways: EDIT2：稍微更详细的源代码，以明确为什么 HashSet 中这个缺失功能的一些琐碎的解决方法没有帮助，并希望更好地展示为什么 HashSet 在所有其他方面都是正确的类：

HashSet<MyClass> candidates;
HashSet<MyClass> unvisited;

... // fill unvisited with typically: 100,000 to 10 million items
... // fill candidates with at least 1 item, potentially 100,000's of items

while( candidates.Count > 0 && unvisited.Count > 0 )
{
  var anyItem = candidates.First();

  while( ! CanProcess( anyItem ) ) // CanProcess probable checks some set intersections
  {
     candidates.Remove( anyItem );
     if( candidates.Count > 0 )
        anyItem = candidates.First();
     else
     {
        anyItem = null;
        break;
     }
  }

  if( anyItem == null ) // we've run out of candidates
     break;

  // For the algorithm: "processing" anyItem has a side-effect of 
  // transferring 0 or more items from "unvisited" into "candidates"
  var extraCandidates = Process( anyItem, unvisited );
  // ... Process probably does some set intersections
  
  ... // add all the extraCandidates to candidates
  ... // remove all the extraCandidates from unvisited
  
  candidates.Remove( anyItem );
}

ie: Typical greedy algorithm that has several sets: one set of "starting points for next iteration", and one or more sets (here I've only shown one) of "data that hasn't been processed yet, and is somehow connected to / reachable from the starting points".即：典型的贪心算法有几组：一组“下一次迭代的起点”，一组或多组（这里我只展示了一组）尚未处理的“数据”，并且以某种方式连接从起点到/可达”。

...everything there is fast, the only thing that's slow is the "First" call - and I have no reason to take the first, I could take any, but I need to take something! ......那里的一切都很快，唯一慢的是“第一个”电话——我没有理由接受第一个，我可以接受任何，但我需要接受一些东西！

Answer 1

It seems that the HashSet class is not optimized for the scenario where its first item is repeatedly removed. HashSet类似乎没有针对重复删除其第一项的场景进行优化。 The space reserved internally by this class is not reduced after each removal, and instead the corresponding slot is marked as empty.此类内部保留的空间在每次移除后不会减少，而是将相应的插槽标记为空。 The enumerator of the class enumerates all the internal slots, and yields a value whenever it finds a non-empty slot.类的枚举器枚举所有内部插槽，并在找到非空插槽时产生一个值。 This can become extremely inefficient when the internal space of a HashSet has become sparsely populated.当HashSet的内部空间变得稀疏时，这会变得非常低效。 For example a HashSet that once hold 1,000,000 elements and has been reduced to a single element, must enumerate 1,000,000 slots before yielding the element stored in its single non-empty slot:例如，曾经包含 1,000,000 个元素并已缩减为单个元素的HashSet ，在生成存储在其单个非空插槽中的元素之前必须枚举 1,000,000 个插槽：

var set = new HashSet<int>(Enumerable.Range(1, 1_000_000));
set.ExceptWith(Enumerable.Range(1, 999_999));
var item = set.First(); // Very slow

This is a problem that is not easy to solve.这是一个不容易解决的问题。 One solution is to call the TrimExcess method after each batch of deletions.一种解决方案是在每批删除后调用TrimExcess方法。 This method minimizes the space reserved internally by the class, by allocating a new array of slots, copying the items from the existing array to the new one, and finally discarding the old array.这种方法通过分配一个新的槽数组，将现有数组中的项复制到新数组，最后丢弃旧数组，来最小化类内部保留的空间。 This is an expensive operation, so calling TrimExcess too frequently could become the new bottleneck of your app.这是一项代价高昂的操作，因此过于频繁地调用TrimExcess可能会成为您应用程序的新瓶颈。

Another solution could be to use a third-party implementation that doesn't suffer from this problem.另一种解决方案可能是使用不受此问题影响的第三方实现。 For example the Rock.Collections library contains an OrderedHashSet class that keeps the items in the order in which they are added.例如， Rock.Collections库包含一个OrderedHashSet类，该类将项目按添加顺序保存。 It achieves this by connecting the internal slots in a linked-list manner.它通过以链表方式连接内部插槽来实现这一点。 The class can be enumerated not only in normal but also in reversed order.类不仅可以正常枚举，也可以倒序枚举。 I didn't tested it, but most probably calling First should be an O(1) operation.我没有测试它，但很可能调用First应该是一个 O(1) 操作。

Below is a solution that uses reflection to trick the built-in enumerator into starting the enumeration of the slots from a random index, instead of the index 0. It offers acceptable performance, but it suffers from the known problems of reflection (overhead, forward compatibility etc).下面是一个使用反射来欺骗内置枚举器从随机索引而不是索引 0 开始枚举插槽的解决方案。兼容性等）。 The static GetRandom method is located inside a generic static class, in order to cache the FieldInfo information separately for each type T .静态GetRandom方法位于通用静态类中，以便为每个类型T分别缓存FieldInfo信息。

public static class HashSetRandomizer<T>
{
    private static FieldInfo _lastIndexField;
    private static FieldInfo _indexField;
    private static ThreadLocal<Random> _random;

    static HashSetRandomizer()
    {
        const BindingFlags FLAGS = BindingFlags.NonPublic | BindingFlags.Instance;
        _lastIndexField = typeof(HashSet<T>).GetField("m_lastIndex", FLAGS) // Framework
            ?? typeof(HashSet<T>).GetField("_lastIndex", FLAGS); // Core
        _indexField = typeof(HashSet<T>.Enumerator).GetField("index", FLAGS) // Framework
            ?? typeof(HashSet<T>.Enumerator).GetField("_index", FLAGS); // Core
        _random = new ThreadLocal<Random>(() => new Random());
    }

    public static T GetRandom(HashSet<T> source, Random random = null)
    {
        if (source == null) throw new ArgumentNullException(nameof(source));
        random = random ?? _random.Value;
        if (_lastIndexField == null)
            throw new NotSupportedException("FieldInfo lastIndex not found.");
        if (_indexField == null)
            throw new NotSupportedException("FieldInfo index not found.");
        if (source.Count > 0)
        {
            int lastIndex = (int)_lastIndexField.GetValue(source);
            if (lastIndex > 0)
            {
                var randomIndex = random.Next(0, lastIndex);
                using (var enumerator = source.GetEnumerator())
                {
                    _indexField.SetValue(enumerator, randomIndex);
                    if (enumerator.MoveNext()) return enumerator.Current;
                }
            }
            foreach (var item in source) return item; // Fallback
        }
        throw new InvalidOperationException("The source sequence is empty.");
    }
}

Usage example.用法示例。 Items are removed randomly from a HashSet , until the set is empty.从HashSet随机删除项目，直到该集合为空。

var set = new HashSet<int>(Enumerable.Range(1, 1_000_000));
while (set.Count > 0)
{
    var item = HashSetRandomizer<int>.GetRandom(set); // Fast
    set.Remove(item);
}

Removing the last few items is still quite slow, even with this approach.即使使用这种方法，删除最后几项仍然很慢。

Answer 2

Using First() all the time requires that the internal Enumerator struct is build all the time.始终使用First()需要始终构建内部Enumerator结构。 But since it is not important which element you get you can retrieve the IEnumerator object only once and then keep read data from it.但是由于获取哪个元素并不重要，因此您只能检索IEnumerator对象一次，然后保留从中读取的数据。 So it is basically a normal foreach loop over the HashSet you have to work with the entries.所以它基本上是HashSet的正常foreach循环，您必须处理条目。

To prevent any "Collection was modified" exceptions you must not remove the proceeded entry from the HashSet until your iteration is complete.为防止任何“集合已修改”异常，在迭代完成之前，您不得从 HashSet 中删除已处理的条目。 So you can save the entries which has been proceeded and delete them afterwards.因此，您可以保存已处理的条目并在之后删除它们。 The source code might look like this:源代码可能如下所示：

HashSet<MyClass> hs /// approx 500,000 items

while(/* metadata based on what's been processed*/ ) // might be adjusted now
{    
    Set<MyClass> toDelete = new HashSet<MyClass>();
    while (MyClass entry in hs) // get Enumerator only once, then iterate normally
    {
        if(ShouldProcess(entry))
        {
            Process(entry);
            toDelete.Remove(entry);
        }
    }
    // finally delete them
    foreach (MyClass entry in toDelete)
    {
        hs.Remove(entry);
    }
}

As you iterate over the whole HashSet and not run your "metadata" check after each entry you might need to adjust your outer while loop due to the fact, that in one outer while loop iteration the whole HashSet is iterated (and entries aren't deleted immediately).当您遍历整个 HashSet 而不是在每个条目之后运行“元数据”检查时，您可能需要调整外部while循环，因为在一个外部while循环迭代中，整个 HashSet 被迭代（并且条目不是立即删除）。

Answer 3

I'm doing a lot of work at the moment with greedy algorithms - they don't care about indexing etc, they only work on groups/sets.我目前正在用贪婪算法做很多工作-他们不在乎索引等，它们只在组/集合上工作。 But I'm finding that 85% of my execution time is spent trying to pick an item out of a HashSet.但是我发现我的执行时间中有85％花费在尝试从HashSet中选择一个项目。

According to MSDN docs:根据MSDN文档：

The HashSet class provides high-performance set operations. HashSet类提供高性能的设置操作。 A set is a collection that contains no duplicate elements, and whose elements are in no particular order.集合是不包含重复元素且其元素没有特定顺序的集合。

...and that appears to be true for everything EXCEPT "fetching a value". ...这似乎适用于除“获取值”之外的所有内容。

I've tried:我试过了：

ElementAt(0) - extremely slow, allegedly because HashSet goes and generates an adhoc ordering, sorts everything, then returns whatever was first ElementAt（0）-非常慢，据说是因为HashSet会生成临时命令，对所有内容进行排序，然后返回最初的内容
First - extremely slow (presumably it's doing the same thing)首先-非常慢（大概是在做同样的事情）

What I expected:我的期望：

AnyItem() - returns an item from the HashSet, with no guarantees. AnyItem（）-从HashSet返回一个项目，不做任何保证。 Could be first, could be last, could be random (but don't count on that).可以是第一个，可以是最后一个，可以是随机的（但不要指望）。

...but I can't seem to figure out a way of doing that. ...但是我似乎无法找出一种解决方法。

EDIT2: slightly more detailed source code, to make it clear why some trivial workarounds for this missing feature in HashSet don't help, and hopefully to show better why HashSet is the right class in all other ways: EDIT2：稍微详细些的源代码，以明确说明为什么HashSet中缺少此功能的一些琐碎变通办法无济于事，并希望能更好地说明为什么HashSet在所有其他方面都是正确的类：

HashSet<MyClass> candidates;
HashSet<MyClass> unvisited;

... // fill unvisited with typically: 100,000 to 10 million items
... // fill candidates with at least 1 item, potentially 100,000's of items

while( candidates.Count > 0 && unvisited.Count > 0 )
{
  var anyItem = candidates.First();

  while( ! CanProcess( anyItem ) ) // CanProcess probable checks some set intersections
  {
     candidates.Remove( anyItem );
     if( candidates.Count > 0 )
        anyItem = candidates.First();
     else
     {
        anyItem = null;
        break;
     }
  }

  if( anyItem == null ) // we've run out of candidates
     break;

  // For the algorithm: "processing" anyItem has a side-effect of 
  // transferring 0 or more items from "unvisited" into "candidates"
  var extraCandidates = Process( anyItem, unvisited );
  // ... Process probably does some set intersections
  
  ... // add all the extraCandidates to candidates
  ... // remove all the extraCandidates from unvisited
  
  candidates.Remove( anyItem );
}

ie: Typical greedy algorithm that has several sets: one set of "starting points for next iteration", and one or more sets (here I've only shown one) of "data that hasn't been processed yet, and is somehow connected to / reachable from the starting points".即：典型的贪心算法有几组：一组“下一次迭代的起点”，以及一组或多组（这里我仅显示了一个）“尚未处理并且已经以某种方式联系在一起的数据”到/可从起点到达”。

...everything there is fast, the only thing that's slow is the "First" call - and I have no reason to take the first, I could take any, but I need to take something! ...一切都快，唯一慢的是“ First”电话-我没有理由获得第一，我可以接受任何，但我需要采取一些措施！

从 HashSet 中选择“任何”项目非常慢 - 如何快速做到这一点？

问题描述

2 个解决方案

解决方案1
2 2020-10-04 14:40:56

解决方案2
0 2020-10-03 17:20:13

解决方案3
0 2020-10-03 18:23:10

从 HashSet 中选择“任何”项目非常慢 - 如何快速做到这一点？

问题描述

2 个解决方案

解决方案1 2 2020-10-04 14:40:56

解决方案2 0 2020-10-03 17:20:13

解决方案3 0 2020-10-03 18:23:10

解决方案1
2 2020-10-04 14:40:56

解决方案2
0 2020-10-03 17:20:13

解决方案3
0 2020-10-03 18:23:10