简体   繁体   English

openMP多线程c ++

[英]openMP multithreading c++

From one of my previous post efficient approach in using c++ thread/ boost thread 来自我以前使用c ++线程/ boost线程的高效方法之一

I know that I could convert a serial program into parallelized one by using openMP and changing: 我知道我可以通过使用openMP并更改将串行程序转换为并行程序:

int thread_count=8;
for(int i=1;i<100000;i++)
{
do_work(i,record);
}

into

int thread_count=8;
#pragma omp parallel for
for(int i=1;i<100000;i++)
{
do_work(i,record);
}

How about fully parallelizing a nested for loop? 如何完全并行化嵌套的for循环? Is it by changing 是通过改变

int thread_count=8;
for(int i=1;i<100000;i++)
{
   for(int j=1;j<100000;j++){
do_work(i,j,record);
 }
}

into

int thread_count=8;
#pragma omp parallel for
for(int i=1;i<100000;i++)
{
   #pragma omp parallel for
   for(int j=1;j<100000;j++){
do_work(i,j,record);
 }
}

For maximal parallelization? 为了最大程度的并行化? Thank you. 谢谢。

This isn't usually a good idea to do that. 这样做通常不是一个好主意。 This implies nested parallelism with creation (more likely only management) of thread pools at each iteration of the outermost loop. 这意味着在最外层循环的每次迭代中,线程创建(更可能是仅管理)嵌套并行性。

However, if really parallelising only the outermost loop isn't sufficient for you (which it should be in most cases) you can always consider using a collapse(2) clause to fuse the i and j loops and having the whole (i,j) domain dealt with in parallel. 但是,如果仅真正并行化最外面的循环不足以满足您的要求(在大多数情况下应该如此),则始终可以考虑使用collapse(2)子句来融合ij循环,并使整个(i,j )域并行处理。

Last solution for special needs if you really need nested parallelism without the OpenMP parallel overhead is to create a single parallel region and to manually assign work to your threads based of their id. 如果您确实需要嵌套并行性而没有OpenMP parallel开销,则特殊需要的最后一种解决方案是创建一个parallel区域,并根据其ID手动将工作分配给线程。 This isn't as simple as just putting compiler directives, but this isn't particularly complicated either... Still, you should only consider that if/when you have very specific needs that you cannot address in a satisfactory manner with usual OpenMP constructs / philosophy. 这不仅仅只是放置编译器指令那么简单,但这也不是特别复杂。。。但是,您应该只考虑如果/当您有非常特殊的需求时,您无法使用常规的OpenMP构造以令人满意的方式解决/哲学。

First of all, to use a specific thread count you should use: 首先,要使用特定的线程数,您应该使用:

int thread_count=8;
#pragma omp parallel for num_threads(thread_count)
for(int i=1;i<100000;i++)
{
do_work(i,record);
}

And if you want nested, you need to turn it on with omp_set_nested(1) . 如果要嵌套,则需要使用omp_set_nested(1)将其打开。

If each thread is doing a similar job, in order to achieve maximum performance in parallelization, you should make sure that the total number of threads corresponds to the number of cores / virtual processors (in case of hyper-threading), so use omp_get_max_threads() to check it. 如果每个线程都在执行相似的工作,则为了在并行化方面获得最大性能,您应确保线程总数与内核/虚拟处理器的数量相对应(在超线程的情况下),因此请使用omp_get_max_threads()进行检查。 And if you use nested parallelization, the number of threads is the product of thread number on each level - so you easily can produce more threads than your virtual processors can effectively support. 而且,如果使用嵌套并行化,则线程数是每个级别上线程数的乘积-因此,您可以轻松地产生比虚拟处理器有效支持的线程更多的线程。

The way you suggested will not give you performance increase, since every thread will be still executing single do_work(...) . 您建议的方法不会提高性能,因为每个线程仍将执行单个do_work(...) However, if single do_work() is long enough, and itself contains some loops, you might get some speed boost, if you apply the second level of paralell processing inside of it. 但是,如果单个do_work()足够长,并且本身包含一些循环,则可以在其中应用第二级并行处理来提高速度。 In this way, your threads run tasks of different length, and the scheduler may squeeze in some short tasks if there are available resources at a given moment. 这样,您的线程将运行不同长度的任务,并且如果给定时刻有可用资源,则调度程序可能会压缩一些简短的任务。

But for this I would not recommend nested OMP - in my experiments, applying the second level of #pragma omp for actually degraded the speed. 但是为此,我不建议嵌套OMP-在我的实验中,应用#pragma omp for的第二级实际上降低了速度。 Yet, you might still get some improvement if you use different mechanisms of multithreading, for example: use OMP for external parallelization and boost thread pool or WinApi _beginthreadex(...) for inner loops. 但是,如果使用不同的多线程机制,您可能仍会得到一些改进,例如:使用OMP进行外部并行化并使用Boost线程池,或者将WinApi _beginthreadex(...)用于内部循环。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM