简体   繁体   English

了解VS2010 C#并行分析结果

[英]Understanding VS2010 C# parallel profiling results

I have a program with many independent computations so I decided to parallelize it. 我有一个程序有很多独立的计算,所以我决定并行化它。

I use Parallel.For/Each. 我使用Parallel.For / Each。

The results were okay for a dual-core machine - CPU utilization of about 80%-90% most of the time. 双核机器的结果还可以 - 大多数时候CPU利用率约为80%-90%。 However, with a dual Xeon machine (ie 8 cores) I get only about 30%-40% CPU utilization, although the program spends quite a lot of time (sometimes more than 10 seconds) on the parallel sections, and I see it employs about 20-30 more threads in those sections compared to serial sections. 但是,使用双Xeon机器(即8个内核),我只获得了大约30%-40%的CPU利用率,尽管该程序在并行部分上花费了相当多的时间(有时超过10秒),我看到它使用了与串行部分相比,这些部分中大约有20-30个线程。 Each thread takes more than 1 second to complete, so I see no reason for them to not work in parallel - unless there is a synchronization problem. 每个线程需要1秒以上才能完成,所以我认为它们没有理由不能并行工作 - 除非存在同步问题。

I used the built-in profiler of VS2010, and the results are strange. 我使用了VS2010的内置分析器,结果很奇怪。 Even though I use locks only in one place, the profiler reports that about 85% of the program's time is spent on synchronization (also 5-7% sleep, 5-7% execution, under 1% IO). 即使我只在一个地方使用锁,分析器报告大约85%的程序时间用于同步(也是5-7%睡眠,5-7%执行,低于1%IO)。

The locked code is only a cache (a dictionary) get/add: 锁定的代码只是一个缓存(字典)get / add:

bool esn_found;
lock (lock_load_esn)
    esn_found = cache.TryGetValue(st, out esn);
if(!esn_found)
{
    esn = pData.esa_inv_idx.esa[term_idx];
    esn.populate(pData.esa_inv_idx.datafile);
    lock (lock_load_esn)
    {
        if (!cache.ContainsKey(st))
            cache.Add(st, esn);
    }
}

lock_load_esn is a static member of the class of type Object. lock_load_esn是Object类的静态成员。
esn.populate reads from a file using a separate StreamReader for each thread. esn.populate使用单独的StreamReader为每个线程从文件中读取。

However, when I press the Synchronization button to see what causes the most delay, I see that the profiler reports lines which are function entrance lines, and doesn't report the locked sections themselves. 但是,当我按下同步按钮以查看导致最大延迟的原因时,我看到探查器报告的是作为功能入口线的线,并且不报告锁定的部分本身。
It doesn't even report the function that contains the above code (reminder - the only lock in the program) as part of the blocking profile with noise level 2%. 它甚至没有报告包含上述代码的功能(提醒 - 程序中唯一的锁定 )作为阻塞配置文件的一部分,噪声级别为2%。 With noise level at 0% it reports all the functions of the program, which I don't understand why they count as blocking synchronizations. 当噪音水平为0%时,它会报告程序的所有功能,我不明白为什么它们被视为阻塞同步。

So my question is - what is going on here? 所以我的问题是 - 这里发生了什么?
How can it be that 85% of the time is spent on synchronization? 85%的时间花在同步上怎么样?
How do I find out what really is the problem with the parallel sections of my program? 如何找出程序中并行部分的实际问题?

Thanks. 谢谢。

Update : After drilling down into the threads (using the extremely useful visualizer) I found out that most of the synchronization time was spent on waiting for the GC thread to complete memory allocations, and that frequent allocations were needed because of generic data structures resize operations. 更新 :深入研究线程(使用非常有用的可视化工具)后,我发现大部分同步时间都花在等待GC线程完成内存分配上,并且由于通用数据结构调整大小操作需要频繁的分配。

I'll have to see how to initialize my data structures so that they allocate enough memory on initialization, possibly avoiding this race for the GC thread. 我将不得不看看如何初始化我的数据结构,以便它们在初始化时分配足够的内存,可能避免GC线程的这种竞争。

I'll report the results later today. 我今天晚些时候会报告结果。

Update : It appears memory allocations were indeed the cause of the problem. 更新 :看起来内存分配确实是问题的原因。 When I used initial capacities for all Dictionaries and Lists in the parallel executed class, the synchronization problem were smaller. 当我在并行执行的类中使用所有词典和列表的初始容量时,同步问题更小。 I now had only about 80% Synchronization time, with spikes of 70% CPU utilization (previous spikes were only about 40%). 我现在只有大约80%的同步时间,CPU利用率达到70%(先前的峰值仅为40%左右)。

I drilled even further into each thread and discovered that now many calls to GC allocate were made for allocating small objects which were not part of the large dictionaries. 我进一步钻进每个线程,发现现在很多调用GC分配用于分配不属于大字典的小对象。

I solved this issue by providing each thread with a pool of preallocated such objects, which I use instead of calling the "new" function. 我通过为每个线程提供一个预先分配的这类对象池来解决这个问题,我使用它而不是调用“new”函数。

So I essentially implemented a separate pool of memory for each thread, but in a very crude way, which is very time consuming and actually not very good - I still have to use a lot of new for the initialization of these objects, only now I do it once globally and there is less contention on the GC thread, even when having to increase the size of the pool. 所以我基本上为每个线程实现了一个单独的内存池,但是以非常粗糙的方式,这非常耗时,实际上并不是很好 - 我仍然需要使用很多新的来初始化这些对象,只有现在我全局执行一次,即使不得不增加池的大小,GC线程上的争用也会减少。

But this is definitely not a solution I like as it is not generalized easily and I wouldn't like to write my own memory manager. 但这绝对不是我喜欢的解决方案,因为它不容易推广,我不想写自己的内存管理器。
Is there a way to tell .NET to allocate a predefined amount of memory for each thread, and then take all memory allocations from the local pool? 有没有办法告诉.NET为每个线程分配预定义的内存量,然后从本地池中获取所有内存分配?

Can you allocate less? 你能减少分配吗?

I've had a couple similar experiences, looking at bad perf and discovering the heart of the issue was the GC. 我有过几次类似的经历,看着糟糕的性能,并发现问题的核心是GC。 In each case, though, I discovered that I was accidentally hemorrhaging memory in some inner loop, allocating tons of temporary objects needlessly. 但是,在每种情况下,我都发现我在一些内环中意外地耗尽了记忆,不必要地分配了大量的临时物体。 I'd give the code a careful look and see if there are allocations you can remove. 我会仔细查看代码,看看是否有可以删除的分配。 I think it's rare for programs to 'need' to allocate heavily in inner loops. 我认为程序“需要”在内循环中大量分配是很少见的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM