简体   繁体   English

创建具有大量元素的 Hashset (1M)

[英]Create Hashset with a large number of elements (1M)

I have to create a HashSet with the elements from 1 to N+1, where N is a large number (1M).我必须使用从 1 到 N+1 的元素创建一个 HashSet,其中 N 是一个大数(1M)。

For example, if N = 5, the HashSet will have then integers {1, 2, 3, 4, 5, 6 }.例如,如果 N = 5,则 HashSet 将具有整数 {1, 2, 3, 4, 5, 6 }。

The only way I have found is:我发现的唯一方法是:

HashSet<int> numbers = new HashSet<int>(N);

for (int i = 1; i <= (N + 1) ; i++)
{
    numbers.Add(i);
}

Are there another faster (more efficient) ways to do it?还有另一种更快(更有效)的方法吗?

6 is a tiny number of items so I suspect the real problem is adding a few thousand items. 6 是很少的项目,所以我怀疑真正的问题是添加几千个项目。 The delays in this case are caused by buffer reallocations, not the speed of Add itself.这种情况下的延迟是由缓冲区重新分配引起的,而不是Add本身的速度。

The solution to this is to specify even an approximate capacity when constructing the HashSet:解决这个问题的方法是在构造 HashSet 时指定一个近似的容量:

var set=new HashSet<int>(1000);

If, and only if, the input implements ICollection<T> , the HashSet<T>(IEnumerable<T>) constructor will check the size of input collection and use it as its capacity:当且仅当输入实现ICollection<T>时, HashSet<T>(IEnumerable<T>)构造函数将检查输入集合的大小并将其用作其容量:

if (collection is ICollection<T> coll)
{
    int count = coll.Count;
    if (count > 0)
    {
        Initialize(count);
    }
}

Explanation解释

Most containers in .NET use buffers internally to store data. .NET 中的大多数容器在内部使用缓冲区来存储数据。 This is far faster than implementing containers using pointers, nodes etc due to CPU cache and RAM access delays.由于 CPU 缓存和 RAM 访问延迟,这比使用指针、节点等实现容器快得多。 Accessing the next item in the CPU's cache is far faster than chasing a pointer in RAM in all CPUs.访问 CPU 缓存中的下一项比在所有 CPU 中追逐 RAM 中的指针要快得多。

The downside is that each time the buffer is full a new one will have to be allocated.缺点是每次缓冲区已满时都必须分配一个新的缓冲区。 Typically, this buffer will have twice the size of the original buffer.通常,此缓冲区的大小将是原始缓冲区的两倍。 Adding items one by one can result in log2(N) reallocations.一项一项地添加项目会导致 log2(N) 重新分配。 This works fine for a moderate number of items but can result in a lot of orphaned buffers when adding eg 1000 items one by one.这适用于中等数量的项目,但在逐个添加例如 1000 个项目时可能会导致大量孤立缓冲区。 All those temporary buffers will have to be garbage collected at some point, causing additional delays.所有这些临时缓冲区都必须在某个时候进行垃圾收集,从而导致额外的延迟。

Here's the code to test the three options:这是测试三个选项的代码:

var N = 1000000;
var trials = new List<(int method, TimeSpan duration)>();

for (var i = 0; i < 100; i++)
{
    var sw = Stopwatch.StartNew();
    HashSet<int> numbers1 = new HashSet<int>(Enumerable.Range(1, N + 1));
    sw.Stop();
    trials.Add((1, sw.Elapsed));
    sw = Stopwatch.StartNew();
    HashSet<int> numbers2 = new HashSet<int>(N);
    for (int n = 1; n < N + 1; n++)
        numbers2.Add(n);
    sw.Stop();
    trials.Add((2, sw.Elapsed));
    HashSet<int> numbers3 = new HashSet<int>(N);
    foreach (int n in Enumerable.Range(1, N + 1))
        numbers3.Add(n);
    sw.Stop();
    trials.Add((3, sw.Elapsed));
}

for (int j = 1; j <= 3; j++)
    Console.WriteLine(trials.Where(x => x.method == j).Average(x => x.duration.TotalMilliseconds));

Typical output is this:典型的 output 是这样的:

31.314788
16.493208
16.493208

It is nearly twice as fast to preallocate the capacity of the HashSet<int> .预分配HashSet<int>的容量几乎快两倍。

There is no difference between the traditional loop and a LINQ foreach option.传统循环和 LINQ foreach选项之间没有区别。

To build on @Enigmativity's answer , here's a proper benchmark using BenchmarkDotNet:为了建立@Enigmativity 的答案,这里有一个使用 BenchmarkDotNet 的适当基准:

public class Benchmark
{
    private const int N = 1000000;

    [Benchmark]
    public HashSet<int> EnumerableRange() => new HashSet<int>(Enumerable.Range(1, N + 1));

    [Benchmark]
    public HashSet<int> NoPreallocation()
    {
        var result = new HashSet<int>();
        for (int n = 1; n < N + 1; n++)
        {
            result.Add(n);
        }
        return result;
    }

    [Benchmark]
    public HashSet<int> Preallocation()
    {
        var result = new HashSet<int>(N);
        for (int n = 1; n < N + 1; n++)
        {
            result.Add(n);
        }
        return result;
    }
}

public class Program
{
    public static void Main(string[] args)
    {
        BenchmarkRunner.Run(typeof(Program).Assembly);
    }
}

With the results:结果:

Method方法 Mean意思是 Error错误 StdDev标准差
EnumerableRange可枚举范围 29.17 ms 29.17 毫秒 0.743 ms 0.743 毫秒 2.179 ms 2.179 毫秒
NoPreallocation无预分配 23.96 ms 23.96 毫秒 0.471 ms 0.471 毫秒 0.775 ms 0.775 毫秒
Preallocation预分配 11.68 ms 11.68 毫秒 0.233 ms 0.233 毫秒 0.665 ms 0.665 毫秒

As we can see, using linq is a bit slower than not using linq (as expected), and pre-allocating saves a significant amount of time.正如我们所看到的,使用 linq 比不使用 linq 慢一点(正如预期的那样),并且预分配可以节省大量时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM