托管堆是否不可扩展到多核系统

Question

I was seeing some strange behavior in a multi threading application which I wrote and which was not scaling well across multiple cores. 我在一个多线程应用程序中看到了一些奇怪的行为，我写了这个行为并没有在多个核心上很好地扩展。

The following code illustrates the behavior I am seeing. 以下代码说明了我看到的行为。 It appears the heap intensive operations do not scale across multiple cores rather they seem to slow down. 堆密集型操作似乎不会跨多个核心扩展，而是看起来速度变慢。 ie using a single thread would be faster. 即使用单个线程会更快。

class Program
{
   public static Data _threadOneData = new Data();
   public static Data _threadTwoData = new Data();
   public static Data _threadThreeData = new Data();
   public static Data _threadFourData = new Data();

   static void Main(string[] args)
   {
      // Do heap intensive tests
      var start = DateTime.Now;
      RunOneThread(WorkerUsingHeap);
      var finish = DateTime.Now;
      var timeLapse = finish - start;
      Console.WriteLine("One thread using heap: " + timeLapse);

      start = DateTime.Now;
      RunFourThreads(WorkerUsingHeap);
      finish = DateTime.Now;
      timeLapse = finish - start;
      Console.WriteLine("Four threads using heap: " + timeLapse);

      // Do stack intensive tests
      start = DateTime.Now;
      RunOneThread(WorkerUsingStack);
      finish = DateTime.Now;
      timeLapse = finish - start;
      Console.WriteLine("One thread using stack: " + timeLapse);

      start = DateTime.Now;
      RunFourThreads(WorkerUsingStack);
      finish = DateTime.Now;
      timeLapse = finish - start;
      Console.WriteLine("Four threads using stack: " + timeLapse);

      Console.ReadLine();
   }

   public static void RunOneThread(ParameterizedThreadStart worker)
   {
      var threadOne = new Thread(worker);
      threadOne.Start(_threadOneData);

      threadOne.Join();
   }

   public static void RunFourThreads(ParameterizedThreadStart worker)
   {
      var threadOne = new Thread(worker);
      threadOne.Start(_threadOneData);

      var threadTwo = new Thread(worker);
      threadTwo.Start(_threadTwoData);

      var threadThree = new Thread(worker);
      threadThree.Start(_threadThreeData);

      var threadFour = new Thread(worker);
      threadFour.Start(_threadFourData);

      threadOne.Join();
      threadTwo.Join();
      threadThree.Join();
      threadFour.Join();
   }

   static void WorkerUsingHeap(object state)
   {
      var data = state as Data;
      for (int count = 0; count < 100000000; count++)
      {
         var property = data.Property;
         data.Property = property + 1;
      }
   }

   static void WorkerUsingStack(object state)
   {
      var data = state as Data;
      double dataOnStack = data.Property;
      for (int count = 0; count < 100000000; count++)
      {
         dataOnStack++;
      }
      data.Property = dataOnStack;
   }

   public class Data
   {
      public double Property
      {
         get;
         set;
      }
   }
}

This code was run on a Core 2 Quad (4 core system) with the following results: 此代码在Core 2 Quad（4核系统）上运行，结果如下：

One thread using heap: 00:00:01.8125000 一个使用堆的线程：00：00：01.8125000

Four threads using heap: 00:00:17.7500000 使用堆的四个线程：00：00：17.7500000

One thread using stack: 00:00:00.3437500 使用堆栈的一个线程：00：00：00.3437500

Four threads using stack: 00:00:00.3750000 使用堆栈的四个线程：00：00：00.3750000

So using the heap with four threads did 4 times the work but took almost 10 times as long. 因此，使用具有四个线程的堆执行了4倍的工作但是花费了近10倍的时间。 This means it would be twice as fast in this case to use only one thread?????? 这意味着在这种情况下只使用一个线程的速度会快两倍??????

Using the stack was much more as expected. 使用堆栈远远超出预期。

I would like to know what is going on here. 我想知道这里发生了什么。 Can the heap only be written to from one thread at a time? 堆只能一次从一个线程写入吗？

Answer 1

The answer is simple - run outside of Visual Studio... 答案很简单 - 在Visual Studio之外运行...

I just copied your entire program, and ran it on my quad core system. 我刚刚复制了你的整个程序，并在我的四核系统上运行它。

Inside VS (Release Build): 内部VS（发布版）：

One thread using heap: 00:00:03.2206779
Four threads using heap: 00:00:23.1476850
One thread using stack: 00:00:00.3779622
Four threads using stack: 00:00:00.5219478

Outside VS (Release Build): 外部VS（发布版）：

One thread using heap: 00:00:00.3899610
Four threads using heap: 00:00:00.4689531
One thread using stack: 00:00:00.1359864
Four threads using stack: 00:00:00.1409859

Note the difference. 注意区别。 The extra time in the build outside VS is pretty much all due to the overhead of starting the threads. VS外部构建的额外时间几乎都是由于启动线程的开销。 Your work in this case is too small to really test, and you're not using the high performance counters, so it's not a perfect test. 在这种情况下你的工作太小而无法真正测试，而且你没有使用高性能计数器，因此它不是一个完美的测试。

Main rule of thumb - always do perf. 主要经验法则 - 总是做穿孔。 testing outside VS, ie: use Ctrl+F5 instead of F5 to run. 在VS外测试，即：使用Ctrl + F5代替F5运行。

Answer 2

Aside from the debug-vs-release effects, there is something more you should be aware of. 除了调试与释放效果之外，还有一些你应该注意的事情。

You cannot effectively evaluate multi-threaded code for performance in 0.3s. 您无法在0.3秒内有效地评估多线程代码的性能。

The point of threads is two-fold: effectively model parallel work in code, and effectively exploit parallel resources (cpus, cores). 线程的重点是双重的：有效地模拟代码中的并行工作，并有效地利用并行资源（cpus，cores）。

You are trying to evaluate the latter. 你正试图评估后者。 Given that thread start overhead is not vanishingly small in comparison to the interval over which you are timing, your measurement is immediately suspect. 鉴于线程启动开销与您计时的时间间隔相比并不是很小，您的测量结果立即被怀疑。 In most perf test trials, a significant warm up interval is appropriate. 在大多数性能测试试验中，显着的预热间隔是合适的。 This may sound silly to you - it's a computer program fter all, not a lawnmower. 这对你来说可能听起来很愚蠢 - 这是一个计算机程序，而不是割草机。 But warm-up is absolutely imperative if you are really going to evaluate multi-thread performance. 但是，如果您真的要评估多线程性能，那么热身是绝对必要的。 Caches get filled, pipelines fill up, pools get filled, GC generations get filled. 缓存填满，管道填满，池填满，GC代填满。 The steady-state, continuous performance is what you would like to evaluate. 您希望评估稳态持续性能。 For purposes of this exercise, the program behaves like a lawnmower. 出于本练习的目的，该程序的行为类似于割草机。

You could say - Well, no, I don't want to evaluate the steady state performance. 你可以说 - 嗯，不，我不想评估稳态性能。 And if that is the case, then I would say that your scenario is very specialized. 如果是这种情况，那么我会说你的场景非常专业。 Most app scenarios, whether their designers explicitly realize it or not, need continuous, steady performance. 大多数应用场景，无论他们的设计者是否明确意识到，都需要持续，稳定的性能。

If you truly need the perf to be good only over a single 0.3s interval, you have found your answer. 如果你真的只需要在0.3秒的时间间隔内获得好的表现，你就找到了自己的答案。 But be careful to not generalize the results. 但要小心不要概括结果。

If you want general results, you need to have reasonably long warm up intervals, and longer collection intervals. 如果您想要一般结果，则需要有相当长的预热间隔和更长的采集间隔。 You might start at 20s/60s for those phases, but here is the key thing: you need to vary those intervals until you find the results converging. 对于这些阶段，您可能从20秒/ 60秒开始，但这是关键的事情：您需要改变这些间隔，直到您发现结果收敛为止。 YMMV. 因人而异。 The valid times vary depending on the application workload and the resources dedicated to it, obviously. 显然，有效时间取决于应用程序工作负载和专用资源。 You may find that a measurement interval of 120s is necessary for convergence, or you may find 40s is just fine. 你可能会发现收敛需要120s的测量间隔，或者你可能会发现40s就好了。 But (a) you won't know until you measure it, and (b) you can bet 0.3s is not long enough. 但是（a）在你测量它之前你不会知道，并且（b）你可以下注0.3s不够长。

Answer 3

[edit]Turns out, this is a release vs. debug build issue -- not sure why it is, but it is. [编辑]事实证明，这是一个发布与调试构建问题 - 不确定它为什么，但确实如此。 See comments and other answers.[/edit] 见评论和其他答案。[/ edit]

This was very interesting -- I wouldn't have guessed there'd be that much difference. 这非常有趣 - 我不会猜到会有那么大的差异。 (similar test machine here -- Core 2 Quad Q9300) （类似的测试机器 - Core 2 Quad Q9300）

Here's an interesting comparison -- add a decent-sized additional element to the 'Data' class -- I changed it to this: 这是一个有趣的比较 - 在'Data'类中添加一个体面大小的附加元素 - 我将其更改为：

public class Data
{
    public double Property { get; set; }
    public byte[] Spacer = new byte[8096];
}

It's still not quite the same time, but it's very close (running it for 10x as long results in 13.1s vs. 17.6s on my machine). 它仍然不是完全相同的时间，但它非常接近（运行它10倍，结果是13.1秒，而我机器上的17.6秒）。

If I had to guess, I'd speculate that it's related to cross-core cache coherency, at least if I'm remembering how CPU cache works. 如果我不得不猜测，我推测它与跨核心缓存一致性有关，至少如果我记得CPU缓存是如何工作的话。 With the small version of 'Data', if a single cache line contains multiple instances of Data, the cores are having to constantly invalidate each other's caches (worst case if they're all on the same cache line). 对于'Data'的小版本，如果单个缓存行包含多个Data实例，则核心必须不断地使彼此的缓存无效（最坏的情况是它们都在同一缓存行上）。 With the 'spacer' added, their memory addresses are sufficiently far enough apart that one CPU's write of a given address doesn't invalidate the caches of the other CPUs. 添加'spacer'后，它们的内存地址足够远，以至于一个CPU对给定地址的写入不会使其他CPU的高速缓存无效。

Another thing to note -- the 4 threads start nearly concurrently, but they don't finish at the same time -- another indication that there's cross-core issues at work here. 另外需要注意的是 - 4个线程几乎同时启动，但它们没有同时完成 - 另一个迹象表明存在交叉核心问题。 Also, I'd guess that running on a multi-cpu machine of a different architecture would bring more interesting issues to light here. 另外，我猜想在不同架构的多CPU机器上运行会带来更多有趣的问题。

I guess the lesson from this is that in a highly-concurrent scenario, if you're doing a bunch of work with a few small data structures, you should try to make sure they aren't all packed on top of each other in memory. 我想从这里得到的教训是，在一个高度并发的场景中，如果你正在做一些使用一些小数据结构的工作，你应该尝试确保它们并非在内存中彼此叠加。。 Of course, there's really no way to make sure of that, but I'm guessing there are techniques (like adding spacers) that could be used to try to make it happen. 当然，实际上没有办法确保这一点，但我猜测有些技术（如添加垫片）可以用来试图实现它。

[edit] This was too interesting -- I couldn't put it down. [编辑]这太有趣了 - 我无法爱不释手。 To test this out further, I thought I'd try varying-sized spacers, and use an integer instead of a double to keep the object without any added spacers smaller. 为了进一步测试这个，我想我会尝试不同大小的间隔物，并使用整数而不是双重来保持对象而不增加任何间隔物。

class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("name\t1 thread\t4 threads");
        RunTest("no spacer", WorkerUsingHeap, () => new Data());

        var values = new int[] { -1, 0, 4, 8, 12, 16, 20 };
        foreach (var sv in values)
        {
            var v = sv;
            RunTest(string.Format(v == -1 ? "null spacer" : "{0}B spacer", v), WorkerUsingHeap, () => new DataWithSpacer(v));
        }

        Console.ReadLine();
    }

    public static void RunTest(string name, ParameterizedThreadStart worker, Func<object> fo)
    {
        var start = DateTime.UtcNow;
        RunOneThread(worker, fo);
        var middle = DateTime.UtcNow;
        RunFourThreads(worker, fo);
        var end = DateTime.UtcNow;

        Console.WriteLine("{0}\t{1}\t{2}", name, middle-start, end-middle);
    }

    public static void RunOneThread(ParameterizedThreadStart worker, Func<object> fo)
    {
        var data = fo();
        var threadOne = new Thread(worker);
        threadOne.Start(data);

        threadOne.Join();
    }

    public static void RunFourThreads(ParameterizedThreadStart worker, Func<object> fo)
    {
        var data1 = fo();
        var data2 = fo();
        var data3 = fo();
        var data4 = fo();

        var threadOne = new Thread(worker);
        threadOne.Start(data1);

        var threadTwo = new Thread(worker);
        threadTwo.Start(data2);

        var threadThree = new Thread(worker);
        threadThree.Start(data3);

        var threadFour = new Thread(worker);
        threadFour.Start(data4);

        threadOne.Join();
        threadTwo.Join();
        threadThree.Join();
        threadFour.Join();
    }

    static void WorkerUsingHeap(object state)
    {
        var data = state as Data;
        for (int count = 0; count < 500000000; count++)
        {
            var property = data.Property;
            data.Property = property + 1;
        }
    }

    public class Data
    {
        public int Property { get; set; }
    }
    public class DataWithSpacer : Data
    {
        public DataWithSpacer(int size) { Spacer = size == 0 ? null : new byte[size]; }
        public byte[] Spacer;
    }
}

Result: 结果：

1 thread vs. 4 threads 1个线程与4个线程

no spacer 00:00:06.3480000 00:00:42.6260000 没有垫片00：00：06.3480000 00：00：42.6260000
null spacer 00:00:06.2300000 00:00:36.4030000 null spacer 00：00：06.2300000 00：00：36.4030000
0B spacer 00:00:06.1920000 00:00:19.8460000 0B spacer 00：00：06.1920000 00：00：19.8460000
4B spacer 00:00:06.1870000 00:00:07.4150000 4B spacer 00：00：06.1870000 00：00：07.4150000
8B spacer 00:00:06.3750000 00:00:07.1260000 8B spacer 00：00：06.3750000 00：00：07.1260000
12B spacer 00:00:06.3420000 00:00:07.6930000 12B spacer 00：00：06.3420000 00：00：07.6930000
16B spacer 00:00:06.2250000 00:00:07.5530000 16B spacer 00：00：06.2250000 00：00：07.5530000
20B spacer 00:00:06.2170000 00:00:07.3670000 20B spacer 00：00：06.2170000 00：00：07.3670000

No spacer = 1/6th the speed, null spacer = 1/5th the speed, 0B spacer = 1/3th the speed, 4B spacer = full speed. 无间隔=速度的1/6，零间隔=速度的1/5，0B间隔=速度的1/3，4B间隔=全速。

I don't know the full details of how the CLR allocates or aligns objects, so I can't speak to what these allocation patterns look like in real memory, but these definitely are some interesting results. 我不知道CLR如何分配或对齐对象的完整细节，所以我无法谈论这些分配模式在实际内存中的样子，但这些肯定是一些有趣的结果。

托管堆是否不可扩展到多核系统

问题描述

3 个解决方案

解决方案1
13 已采纳 2009-05-26 23:34:04

解决方案2
3 2009-05-27 01:22:37

解决方案3
2 2009-05-26 23:24:07

托管堆是否不可扩展到多核系统

问题描述

3 个解决方案

解决方案1 13 已采纳 2009-05-26 23:34:04

解决方案2 3 2009-05-27 01:22:37

解决方案3 2 2009-05-26 23:24:07

解决方案1
13 已采纳 2009-05-26 23:34:04

解决方案2
3 2009-05-27 01:22:37

解决方案3
2 2009-05-26 23:24:07