如何优化 n-queens OpenMP 并行程序？

Question

I'm working on parallelizing the n-queens problem using OpenMP, but my sequential program is just as fast.我正在使用 OpenMP 并行化 n-queens 问题，但我的顺序程序同样快。 I've been trying to use num_threads , but I don't think I am doing it correctly.我一直在尝试使用num_threads ，但我认为我做得不对。

Can someone look at my code and tell me what I am doing wrong or give me some pointers?有人可以查看我的代码并告诉我我做错了什么或给我一些指示吗？ Thank you.谢谢你。

Here is my parallel program:这是我的并行程序：

// Parallel version of the N-Queens problem.


#include <iostream>  
#include <omp.h>
#include <time.h>
#include <sys/time.h>

// Timing execution
double startTime, endTime;

// Number of solutions found
int numofSol = 0;

// Board size and number of queens
#define N 11

void placeQ(int queens[], int row, int column) {
    
    for(int i = 0; i < row; i++) {
        // Vertical
        if (queens[i] == column) {
            return;
        }
        
        // Two queens in the same diagonal
        if (abs(queens[i] - column) == (row-  i))  {
            return;
        }
    }
    
    // Set the queen
    queens[row] = column;
    
    if(row == N-1) {
        
        #pragma omp atomic 
            numofSol++;  //Placed the final queen, found a solution
        
        #pragma omp critical
        {
            std::cout << "The number of solutions found is: " << numofSol << std::endl; 
            for (int row = 0; row < N; row++) {
                for (int column = 0; column < N; column++) {
                    if (queens[row] == column) {
                        std::cout << "X";
                    }
                    else {
                        std::cout << "|";
                    }
                }
                std::cout  << "\n"  << std::endl; 
            }
        }
    }
    
    else {
        
        // Increment row
        for(int i = 0; i < N; i++) {
            placeQ(queens, row + 1, i);
        }
    }
} // End of placeQ()

void solve() {
    #pragma omp parallel num_threads(30)
    #pragma omp single
    {
        for(int i = 0; i < N; i++) {
            // New task added for first row and each column recursion.
            #pragma omp task
            { 
                placeQ(new int[N], 0, i);
            }
        }
    }
} // end of solve()

int main(int argc, char*argv[]) {

    startTime = omp_get_wtime();   
    solve();
    endTime = omp_get_wtime();
  
    // Print board size, number of solutions, and execution time. 
    std::cout << "Board Size: " << N << std::endl; 
    std::cout << "Number of solutions: " << numofSol << std::endl; 
    std::cout << "Execution time: " << endTime - startTime << " seconds." << std::endl; 
    
    return 0;
}

Answer 1

More than 95% of the execution time of your program is spent in printing strings in the console and this operation is put in a critical section so that only one thread can do it at a time.超过 95%的程序执行时间都花在了在控制台打印字符串上，并且这个操作被放在了一个临界区，这样一次只有一个线程可以完成它。 The overhead of the IO operations and the critical section grows with the number of threads used. IO 操作和临界区的开销随着使用的线程数而增长。 Consequently, the sequential execution time is better than the parallel one.因此，顺序执行时间优于并行执行时间。

Actually, to be more precise, it is not the printing that is slow, but the synchronization with the console caused by std::endl which implicitly performs a std::flush , and the string formatting .实际上，更准确地说，慢的不是打印，而是由std::endl隐式执行std::flush和字符串格式化引起的与控制台的同步。 Thus, to fix that, you can prepare a thread-local string in parallel ( std::ostringstream is good for that).因此，要解决这个问题，您可以并行准备一个线程本地字符串（ std::ostringstream对此很有用）。 The local string can then be appended to a big global one and its content can be printed in the main thread sequentially (to prevent any additional overhead caused by parallel IOs) and outside the timed section.然后可以将本地字符串附加到一个大的全局字符串中，并且可以在主线程中顺序打印其内容（以防止由并行 IO 引起的任何额外开销）和定时部分之外。

Besides this, you have 11 tasks and you create 30 threads for that in your code while you probably have less than 30 cores (or even 30 hardware threads).除此之外，您有 11 个任务，并且在代码中为此创建了 30 个线程，而您的内核可能少于 30 个（甚至是 30 个硬件线程）。 Creating too many threads is costly (mainly due to thread-preemption/scheduling).创建太多线程代价高昂（主要是由于线程抢占/调度）。 Moreover, specifying the number of threads in the program is generally a bad practice.此外，指定程序中的线程数通常是一种不好的做法。 Please use the portable environment variable OMP_NUM_THREADS for that.请为此使用可移植环境变量OMP_NUM_THREADS 。

Here is the code tacking into account the above remarks:这是考虑到上述评论的代码：

// Parallel version of the N-Queens problem.

#include <iostream>  
#include <omp.h>
#include <time.h>
#include <sys/time.h>
#include <sstream>

// Timing execution
double startTime, endTime;

// Number of solutions found
int numofSol = 0;

std::ostringstream globalOss;

// Board size and number of queens
#define N 11

void placeQ(int queens[], int row, int column) {
    
    for(int i = 0; i < row; i++) {
        // Vertical
        if (queens[i] == column) {
            return;
        }
        
        // Two queens in the same diagonal
        if (abs(queens[i] - column) == (row-  i))  {
            return;
        }
    }
    
    // Set the queen
    queens[row] = column;
    
    if(row == N-1) {
        
        #pragma omp atomic 
            numofSol++;  //Placed the final queen, found a solution
        
        std::ostringstream oss;
        oss << "The number of solutions found is: " << numofSol << std::endl; 
        for (int row = 0; row < N; row++) {
            for (int column = 0; column < N; column++) {
                if (queens[row] == column) {
                    oss << "X";
                }
                else {
                    oss << "|";
                }
            }
            oss  << std::endl << std::endl; 
        }

        #pragma omp critical
        globalOss << oss.str();
    }
    
    else {
        
        // Increment row
        for(int i = 0; i < N; i++) {
            placeQ(queens, row + 1, i);
        }
    }
} // End of placeQ()

void solve() {
    #pragma omp parallel //num_threads(30)
    #pragma omp single
    {
        for(int i = 0; i < N; i++) {
            // New task added for first row and each column recursion.
            #pragma omp task
            { 
                placeQ(new int[N], 0, i);
            }
        }
    }
} // end of solve()

int main(int argc, char*argv[]) {

    startTime = omp_get_wtime();   
    solve();
    endTime = omp_get_wtime();

    std::cout << globalOss.str();
  
    // Print board size, number of solutions, and execution time. 
    std::cout << "Board Size: " << N << std::endl; 
    std::cout << "Number of solutions: " << numofSol << std::endl; 
    std::cout << "Execution time: " << endTime - startTime << " seconds." << std::endl; 
    
    return 0;
}

Here are the resulting execution time on my machine:以下是我机器上的执行时间：

Time of the reference implementation (30 threads): 0.114309 s

Optimized implementation:
1 thread: 0.018634 s (x1.00)
2 thread: 0.009978 s (x1.87)
3 thread: 0.006840 s (x2.72)
4 thread: 0.005766 s (x3.23)
5 thread: 0.004941 s (x3.77)
6 thread: 0.003963 s (x4.70)

If you want an even faster parallel code, you can:如果您想要更快的并行代码，您可以：

provide a bit more tasks to OpenMP (to improve the work load-balancing ), but not too many (due to the overhead of each task);为 OpenMP 提供更多任务（以改善工作负载平衡），但不要太多（由于每个任务的开销）；
reduce the amount of (implicit) allocations ;减少（隐式）分配的数量；
perform a thread-local reduction on numofSol and use just one atomic update per task .对numofSol执行线程局部缩减，每个任务只使用一个原子更新。

如何优化 n-queens OpenMP 并行程序？

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-04-18 00:37:16

如何优化 n-queens OpenMP 并行程序？

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-04-18 00:37:16

解决方案1
3 已采纳 2021-04-18 00:37:16