openmp is slower than serial for matrix initialization

Question

I'm just learning how to use openmp but it becomes slower by doing the following. Basically, I'm just trying to initialize a huge 2 dimensional matrix.

219     int **scoreMatrix = malloc(sizeof(int *) * (strlen(seq1)+1));
220 
221     int i,j = 0;
222     omp_set_num_threads(6);
224 #pragma omp parallel private(i,j) 
225 {
226     int std = omp_get_thread_num();
227     //Initialize matrix
228     for(i = std; i < strlen(seq1)+1; i=i+nthreads){
229         scoreMatrix[i] = malloc(sizeof(int) * (strlen(seq2)+1));
230         for(j = 0; j < strlen(seq2)+1; j++){
231             scoreMatrix[i][j] = 0;
232         }
233     }
234 }

Please tell me if I am missing any important syntax or concept in OpenMP. Thank you!

Answer 1

While it has been sometime since I last worked with OpenMP, your problem most likely comes down to overheads and the work done by each thread being rather small. You have each thread setup to do 1/6 of the mallocs and 1/6 of the sets to 0. For a problem like this you should consider just how large are seq1 and seq2 and how much work is actually being executed in parallel. For example memory allocation by the standard malloc is likely a point of contention, see for instance this question with a more detailed analysis. If the bulk of the work is being done by malloc and as such not being done in parallel to a large extent then you wouldn't get much of a speedup for paying the overhead of thread initialization. If it is truly needed then you may get improvements from using a different allocator. Setting regions of memory to 0 can be split up amongst the threads, but it is almost certainly extremely fast in comparison to the allocation. There may also be some cache coherency costs to setting scoreMatrix[i] on line 229 as that cacheline is shared amongst the threads.

With OpenMP and MPI it is important to remember that there are overheads involved in simply starting the parallel parts of computations, and as such blocks without much work, even if they could be highly parallel, may not be worth parallelizing. When you get to doing computations on the array you are much more likely to see a benefit.

For zeroing memory in general your best easy bet is likely memset, but your compiler might optimize lines 230 & 231 to do similar things.

Answer 2

You would be better off letting openmp do the parallelisation for you with a #pragma omp parallel for

int **scoreMatrix = malloc(sizeof(int *) * (strlen(seq1)+1));

int i,j = 0;
omp_set_num_threads(6);

#pragma omp parallel for private(i,j) 
for(i = 0; i < strlen(seq1)+1; ++i){
    scoreMatrix[i] = malloc(sizeof(int) * (strlen(seq2)+1));
    for(j = 0; j < strlen(seq2)+1; ++j){
        scoreMatrix[i][j] = 0;
    }
}

This may have an effect depending on how well openmp deals with thread occupancy.

openmp is slower than serial for matrix initialization

Question

2 answers

solution1
1 ACCPTED 2014-11-18 23:15:14

solution2
0 2014-11-18 23:20:55

openmp is slower than serial for matrix initialization

Question

2 answers

solution1 1 ACCPTED 2014-11-18 23:15:14

solution2 0 2014-11-18 23:20:55

solution1
1 ACCPTED 2014-11-18 23:15:14

solution2
0 2014-11-18 23:20:55