简体   繁体   English

为什么ConcurrentBag <T>在.Net(4.0)中这么慢?我做错了吗?

[英]Why is ConcurrentBag<T> so slow in .Net (4.0)? Am I doing it wrong?

Before I started a project, I wrote a simple test to compare the performance of ConcurrentBag from (System.Collections.Concurrent) relative to locking & lists. 在我开始一个项目之前,我编写了一个简单的测试来比较来自(System.Collections.Concurrent)的ConcurrentBag相对于锁定和列表的性能。 I am extremely surprised that ConcurrentBag is over 10 times slower than locking with a simple List. 我非常惊讶ConcurrentBag比使用简单的List锁定慢10倍。 From what I understand, the ConcurrentBag works best when the reader and writer is the same thread. 据我所知,当读写器是同一个线程时,ConcurrentBag效果最好。 However, I hadn't thought it's performance would be so much worse than traditional locks. 但是,我没想到它的性能会比传统的锁更糟糕。

I have run a test with two Parallel for loops writing to and reading from a list/bag. 我已经运行了一个测试,其中有两个Parallel for循环写入和读取列表/包。 However, the write by itself shows a huge difference: 然而,写入本身显示了巨大的差异:

private static void ConcurrentBagTest()
   {
        int collSize = 10000000;
        Stopwatch stopWatch = new Stopwatch();
        ConcurrentBag<int> bag1 = new ConcurrentBag<int>();

        stopWatch.Start();


        Parallel.For(0, collSize, delegate(int i)
        {
            bag1.Add(i);
        });


        stopWatch.Stop();
        Console.WriteLine("Elapsed Time = {0}", 
                          stopWatch.Elapsed.TotalSeconds);
 }

On my box, this takes between 3-4 secs to run, compared to 0.5 - 0.9 secs of this code: 在我的盒子上,这需要3-4秒才能运行,相比之下这段代码的0.5-0.9秒:

       private static void LockCollTest()
       {
        int collSize = 10000000;
        object list1_lock=new object();
        List<int> lst1 = new List<int>(collSize);

        Stopwatch stopWatch = new Stopwatch();
        stopWatch.Start();


        Parallel.For(0, collSize, delegate(int i)
            {
                lock(list1_lock)
                {
                    lst1.Add(i);
                }
            });

        stopWatch.Stop();
        Console.WriteLine("Elapsed = {0}", 
                          stopWatch.Elapsed.TotalSeconds);
       }

As I mentioned, doing concurrent reads and writes doesn't help the concurrent bag test. 正如我所提到的,进行并发读写并不能帮助并发包测试。 Am I doing something wrong or is this data structure just really slow? 我做错了什么还是这个数据结构真的很慢?

[EDIT] - I removed the Tasks because I don't need them here (Full code had another task reading) [编辑] - 我删除了任务,因为我在这里不需要它们(完整代码有另一个任务阅读)

[EDIT] Thanks a lot for the answers. [编辑]非常感谢您的答案。 I am having a hard time picking "the right answer" since it seems to be a mix of a few answers. 我很难选择“正确的答案”,因为它似乎是几个答案的混合。

As Michael Goldshteyn pointed out, the speed really depends on the data. 正如Michael Goldshteyn指出的那样,速度实际上取决于数据。 Darin pointed out that there should be more contention for ConcurrentBag to be faster, and Parallel.For doesn't necessarily start the same number of threads. Darin指出应该有更多争用ConcurrentBag更快,而Parallel.For不一定会启动相同数量的线程。 One point to take away is to not do anything you don't have to inside a lock. 带走的一点是不要做任何事情,你没有到锁内。 In the above case, I don't see myself doing anything inside the lock except may be assigning the value to a temp variable. 在上面的例子中,我没有看到自己在锁内做任何事情,除非可能将值赋给temp变量。

Additionally, sixlettervariables pointed out that the number of threads that happen to be running may also affect results, although I tried running the original test in reverse order and ConcurrentBag was still slower. 另外,六个变量指出,碰巧运行的线程数也可能影响结果,尽管我尝试以相反的顺序运行原始测试,并且ConcurrentBag仍然较慢。

I ran some tests with starting 15 Tasks and the results depended on the collection size among other things. 我在开始15个任务时运行了一些测试,结果取决于集合大小等。 However, ConcurrentBag performed almost as well as or better than locking a list, for up to 1 million insertions. 但是,ConcurrentBag的表现几乎与锁定列表一样好或更好,最多可达100万次插入。 Above 1 million, locking seemed to be much faster sometimes, but I'll probably never have a larger datastructure for my project. 超过100万,锁定似乎有时更快,但我可能永远不会有一个更大的数据结构为我的项目。 Here's the code I ran: 这是我运行的代码:

        int collSize = 1000000;
        object list1_lock=new object();
        List<int> lst1 = new List<int>();
        ConcurrentBag<int> concBag = new ConcurrentBag<int>();
        int numTasks = 15;

        int i = 0;

        Stopwatch sWatch = new Stopwatch();
        sWatch.Start();
         //First, try locks
        Task.WaitAll(Enumerable.Range(1, numTasks)
           .Select(x => Task.Factory.StartNew(() =>
            {
                for (i = 0; i < collSize / numTasks; i++)
                {
                    lock (list1_lock)
                    {
                        lst1.Add(x);
                    }
                }
            })).ToArray());

        sWatch.Stop();
        Console.WriteLine("lock test. Elapsed = {0}", 
            sWatch.Elapsed.TotalSeconds);

        // now try concurrentBag
        sWatch.Restart();
        Task.WaitAll(Enumerable.Range(1, numTasks).
                Select(x => Task.Factory.StartNew(() =>
            {
                for (i = 0; i < collSize / numTasks; i++)
                {
                    concBag.Add(x);
                }
            })).ToArray());

        sWatch.Stop();
        Console.WriteLine("Conc Bag test. Elapsed = {0}",
               sWatch.Elapsed.TotalSeconds);

Let me ask you this: how realistic is it that you'd have an application which is constantly adding to a collection and never reading from it ? 让我问你一个问题:你有一个应用程序不断添加到一个集合而从不读取它是多么现实? What's the use of such a collection? 这样的收藏品有什么用? (This is not a purely rhetorical question. I could imagine there being uses where, eg, you only read from the collection on shutdown (for logging) or when requested by the user. I believe these scenarios are fairly rare, though.) (这不是一个纯粹的反问。我想象有被使用,其中,例如,你只能从收集的关机(用于日志记录)或当用户提出要求。我相信,这些情况是相当罕见的,虽然读。)

This is what your code is simulating. 这就是您的代码模拟的内容。 Calling List<T>.Add is going to be lightning-fast in all but the occasional case where the list has to resize its internal array; 调用List<T>.Add在所有情况下都会闪电般快速,但偶尔的情况是列表必须调整其内部数组的大小; but this is smoothed out by all the other adds that happen quite quickly. 但是很快就会发生所有其他增加的问题。 So you're not likely to see a significant amount of contention in this context, especially testing on a personal PC with, eg, even 8 cores (as you stated you have in a comment somewhere). 因此,在这种情况下,您不太可能看到大量的争用, 特别是在个人PC上进行测试,例如,即使是8个核心(正如您在某处的评论中所述)。 Maybe you might see more contention on something like a 24-core machine, where many cores can be trying to add to the list literally at the same time. 也许你可能会看到像一个24核的机器,有许多内核可以尝试从字面上同时添加到列表中更多的竞争。

Contention is much more likely to creep in where you read from your collection, esp. 从您的收藏中读取的地方,特别是争用的可能性更大。 in foreach loops (or LINQ queries which amount to foreach loops under the hood) which require locking the entire operation so that you aren't modifying your collection while iterating over it. foreach循环(或LINQ查询到量foreach引擎盖下的循环),需要锁定整个操作,让您不会修改您的收藏在遍历它。

If you can realistically reproduce this scenario, I believe you will see ConcurrentBag<T> scale much better than your current test is showing. 如果您能够真实地重现这种情况,我相信您会看到ConcurrentBag<T>比您当前测试显示的要好得多。


Update : Here is a program I wrote to compare these collections in the scenario I described above (multiple writers, many readers). 更新是我编写的一个程序,用于比较上述场景中的这些集合(多个编写器,许多读者)。 Running 25 trials with a collection size of 10000 and 8 reader threads, I got the following results: 运行25个试验,收集大小为10000和8个读取器线程,我得到以下结果:

Took 529.0095 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 39.5237 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
Took 309.4475 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 81.1967 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
Took 228.7669 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 164.8376 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
[ ... ]
Average list time: 176.072456 ms.
Average bag time: 59.603656 ms.

So clearly it depends on exactly what you're doing with these collections. 很明显,这取决于你对这些系列的确切做法。

There seems to be a bug in the .NET Framework 4 that Microsoft fixed in 4.5, it seems they didn't expect ConcurrentBag to be used a lot. 在.NET Framework 4中似乎有一个错误,微软在4.5中修复了它,似乎他们没想到ConcurrentBag会被大量使用。

See the following Ayende post for more info 有关详细信息,请参阅以下Ayende帖子

http://ayende.com/blog/156097/the-high-cost-of-concurrentbag-in-net-4-0 http://ayende.com/blog/156097/the-high-cost-of-concurrentbag-in-net-4-0

As a general answer: 作为一般答案:

  • Concurrent collections that use locking can be very fast if there is little or no contention for their data (ie, locks). 如果对其数据(即锁定)的争用很少或没有,则使用锁定的并发集合可以非常快。 This is due to the fact that such collection classes are often built using very inexpensive locking primitives, especially when uncontented. 这是因为这样的集合类通常使用非常便宜的锁定原语构建,尤其是在没有条件的情况下。
  • Lockless collections can be slower, because of tricks used to avoid locks and due to other bottlenecks such as false sharing, complexity required to implement their lockless nature leading to cache misses, etc... 无锁收集可能会更慢,因为用于避免锁定的技巧和由于其他瓶颈(例如错误共享),实现其无锁性质导致缓存未命中所需的复杂性等等...

To summarize, the decision of which way is faster is highly dependant on the data structures employed and the amount of contention for the locks among other issues (eg, num readers vs. writers in a shared/exclusive type arrangement). 总而言之,哪种方式更快的决定高度依赖于所采用的数据结构以及锁之间的争用量以及其他问题(例如,num读取器与共享/独占类型排列中的写入器)。

Your particular example has a very high degree of contention, so I must say I am surprised by the behavior. 您的特定示例具有非常高的争用程度,因此我必须说我对此行为感到惊讶。 On the other hand, the amount of work done while the lock is kept is very small, so maybe there is little contention for the lock itself, after all. 另一方面,在保持锁定时完成的工作量非常小,因此可能毕竟没有争用锁本身。 There could also be deficiencies in the implementation of ConcurrentBag's concurrency handling which makes your particular example (with frequent inserts and no reads) a bad use case for it. ConcurrentBag的并发处理的实现也可能存在缺陷,这使得您的特定示例(频繁插入和无读取)成为一个糟糕的用例。

Looking at the program using MS's contention visualizer shows that ConcurrentBag<T> has a much higher cost associated with parallel insertion than simply locking on a List<T> . 使用MS的争用可视化工具查看程序显示,与简单地锁定List<T>相比, ConcurrentBag<T>与并行插入相关的成本要高得多。 One thing I noticed is there appears to be a cost associated with spinning up the 6 threads (used on my machine) to begin the first ConcurrentBag<T> run (cold run). 我注意到的一件事是,似乎有一个成本与旋转6个线程(在我的机器上使用)开始第一个ConcurrentBag<T>运行(冷运行)相关。 5 or 6 threads are then used with the List<T> code, which is faster (warm run). 然后使用5或6个线程与List<T>代码,这是更快(热运行)。 Adding another ConcurrentBag<T> run after the list shows it takes less time than the first (warm run). 在列表之后添加另一个ConcurrentBag<T>运行表明它比第一个(热运行)花费的时间更少。

From what I'm seeing in the contention, a lot of time is spent in the ConcurrentBag<T> implementation allocating memory. 从我在争论中看到的,在ConcurrentBag<T>实现中分配内存花费了大量时间。 Removing the explicit allocation of size from the List<T> code slows it down, but not enough to make a difference. List<T>代码中删除显式的大小分配会减慢它的速度,但不足以产生差异。

EDIT: it appears to be that the ConcurrentBag<T> internally keeps a list per Thread.CurrentThread , locks 2-4 times depending on if it is running on a new thread, and performs at least one Interlocked.Exchange . 编辑:似乎ConcurrentBag<T>内部为每个Thread.CurrentThread保留一个列表,锁定2-4次,具体取决于它是否在新线程上运行,并执行至少一个Interlocked.Exchange As noted in MSDN: "optimized for scenarios where the same thread will be both producing and consuming data stored in the bag." 正如MSDN中所指出的那样:“针对同一个线程将产生和消耗存储在数据包中的数据的情况进行了优化。” This is the most likely explanation for your performance decrease versus a raw list. 对于您的性能下降与原始列表相比,这是最可能的解释。

This is already resolved in .NET 4.5. 这已在.NET 4.5中得到解决。 The underlying issue was that ThreadLocal, which ConcurrentBag uses, didn't expect to have a lot of instances. 根本问题是ConcurrentBag使用的ThreadLocal并不期望有很多实例。 That has been fixed, and now can run fairly fast. 这已经修复,现在可以运行得相当快。

source - The HIGH cost of ConcurrentBag in .NET 4.0 source - .NET 4.0中ConcurrentBag的高成本

As @Darin-Dimitrov said, I suspect that your Parallel.For isn't actually spawning the same number of threads in each of the two results. 正如@ Darin-Dimitrov所说,我怀疑你的Parallel.For实际上并没有在两个结果中产生相同数量的线程。 Try manually creating N threads to ensure that you are actually seeing thread contention in both cases. 尝试手动创建N个线程,以确保在两种情况下都实际看到线程争用。

You basically have very few concurrent writes and no contention ( Parallel.For doesn't necessarily mean many threads). 你基本上只有很少的并发写入和没有争用( Parallel.For并不一定意味着许多线程)。 Try parallelizing the writes and you will observe different results: 尝试并行化写入,您将观察到不同的结果:

class Program
{
    private static object list1_lock = new object();
    private const int collSize = 1000;

    static void Main()
    {
        ConcurrentBagTest();
        LockCollTest();
    }

    private static void ConcurrentBagTest()
    {
        var bag1 = new ConcurrentBag<int>();
        var stopWatch = Stopwatch.StartNew();
        Task.WaitAll(Enumerable.Range(1, collSize).Select(x => Task.Factory.StartNew(() =>
        {
            Thread.Sleep(5);
            bag1.Add(x);
        })).ToArray());
        stopWatch.Stop();
        Console.WriteLine("Elapsed Time = {0}", stopWatch.Elapsed.TotalSeconds);
    }

    private static void LockCollTest()
    {
        var lst1 = new List<int>(collSize);
        var stopWatch = Stopwatch.StartNew();
        Task.WaitAll(Enumerable.Range(1, collSize).Select(x => Task.Factory.StartNew(() =>
        {
            lock (list1_lock)
            {
                Thread.Sleep(5);
                lst1.Add(x);
            }
        })).ToArray());
        stopWatch.Stop();
        Console.WriteLine("Elapsed = {0}", stopWatch.Elapsed.TotalSeconds);
    }
}

My guess is that locks don't experience much contention. 我的猜测是锁没有太多的争用。 I would recommend reading following article: Java theory and practice: Anatomy of a flawed microbenchmark . 我建议阅读以下文章: Java理论与实践:有缺陷的微基准的剖析 The article discusses a lock microbenchmark. 本文讨论了锁定微基准测试。 As stated in the article there are a lot of things to take into consideration in this kind of situations. 正如文章所述,在这种情况下需要考虑很多事情。

It would be interesting to see scaling between the two of them. 看到两者之间的比例很有意思。

Two questions 两个问题

1) how fast is bag vs list for reading, remember to put a lock on the list 1)行包与列表的读取速度有多快,记得锁定列表

2) how fast is bag vs list for reading while another thread is writing 2)当另一个线程正在写入时,包与列表的读取速度有多快

Because the loop body is small, you could try using the Partitioner class Create method... 因为循环体很小,你可以尝试使用Partitioner类Create方法......

which enables you to provide a sequential loop for the delegate body, so that the delegate is invoked only once per partition, instead of once per iteration 这使您能够为委托主体提供顺序循环,以便每个分区仅调用一次委托,而不是每次迭代调用一次

How to: Speed Up Small Loop Bodies 如何:加速小环体

It appears that ConcurrentBag is just slower than the other concurrent collections. 似乎ConcurrentBag比其他并发集合慢。

I think it's an implementation problem- ANTS Profiler shows that it is gets bogged down in a couple of places - including an array copy. 我认为这是一个实现问题 - ANTS Profiler显示它在几个地方陷入困境 - 包括数组副本。

Using concurrent dictionary is thousands of times faster. 使用并发字典要快几千倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM