How to parallelize this array correct way using OpenMP?

Question

After I try to parallelize the code with openmp, the elements in the array are wrong, as for the order of the elements is not very important. Or is it more convenient to use c++ std vector instead of array to parallelize, could you suggest a easy way?

#include <stdio.h>
#include <math.h>

int main()
{
    int n = 100;
    int a[n*(n+1)/2]={0};
    int count=0;

    #pragma omp parallel for reduction(+:a,count)
    for (int i = 1; i <= n; i++) {
        for (int j = i + 1; j <= n; j++) {
            double k = sqrt(i * i + j * j);
            if (fabs(round(k) - k) < 1e-10) {
                a[count++] = i;
                a[count++] = j;
                a[count++] = (int) k;
            }
        }
    }
    
    for(int i=0;i<count;i++)
        printf("%d %s",a[i],(i+1)%3?"":", ");
    printf("\ncount: %d", count);
    return 0;
}

Original output:

3 4 5, 5 12 13, 6 8 10, 7 24 25, 8 15 17, 9 12 15, 9 40 41, 10 24 26, 11 60 61, 12 16 20, 12 35 37, 13 84 85, 14 48 50, 15 20 25, 15 36 39, 16 30 34, 16 63 65, 18 24 30, 18 80 82, 20 21 29, 20 48 52, 20 99 101, 21 28 35, 21 72 75, 24 32 40, 24 45 51, 24 70 74, 25 60 65, 27 36 45, 28 45 53, 28 96 100, 30 40 50, 30 72 78, 32 60 68, 33 44 55, 33 56 65, 35 84 91, 36 48 60, 36 77 85, 39 52 65, 39 80 89, 40 42 58, 40 75 85, 40 96 104, 42 56 70, 45 60 75, 48 55 73, 48 64 80, 48 90 102, 51 68 85, 54 72 90, 56 90 106, 57 76 95, 60 63 87, 60 80 100, 60 91 109, 63 84 105, 65 72 97, 66 88 110, 69 92 115, 72 96 120, 75 100 125, 80 84 116,
count: 189

After using OpenMP(gcc file.c -fopenmp):

411 538 679, 344 609 711, 354 533 649, 218 387 449, 225 475 534, 182 283 339, 81 161 182, 74 190 204, 77 138 159, 79 176 195, 18 24 30, 18 80 82, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0, 0 0 0,
count: 189

Answer 1

Your threads are all accessing the shared count .

You would be better off eliminating count and have each loop iteration determine where to write its output based only on the (per-thread) values of i and j .

Alternatively, use a vector to accumulate the results:

#include <cmath>
#include <iostream>
#include <utility>
#include <vector>

#pragma omp declare                                                        \
    reduction(vec_append : std::vector<std::pair<int,int>> :               \
              omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))

int main()
{
    constexpr int n = 100'000;
    std::vector<std::pair<int,int>> result;

#pragma omp parallel for \
            reduction(vec_append:result) \
            schedule(dynamic)
    for (int i = 1;  i <= n;  ++i) {
        for (int j = i + 1;  j <= n;  ++j) {
            auto const h2 = i * i + j * j; // hypotenuse squared
            int h = std::sqrt(h2) + 0.5;   // integer square root
            if (h * h == h2) {
                result.emplace_back(i, j);
            }
        }
    }

    // for (auto const& v: result) {
    //     std::cout << v.first << ' '
    //               << v.second << ' '
    //               << std::hypot(v.first, v.second) << ' ';
    // }
    std::cout << "\ncount: " << result.size() << '\n';
}

Answer 2

The count variable is an index into a . The reduction(+:a,count) operator is summing the array, it is not a concatenation operation which is what I think you are looking for.

The count variable needs to be surrounded by a mutex, something like #pragma omp critical , but I am not an OpenMP expert.

Alternatively, create int a[n][n] , set all of them to -1 (a sentinel value to indicate "invalid") then assign the result of the sqrt() when it is near enough to a whole number.

Answer 3

As an alternative to using a critical section, this solution uses atomics and could therefore be faster.

The following code might freeze your computer due to memory consumption. Be careful!

#include <cstdio>
#include <cmath>

#include <vector>

int main() {
    int const n = 100;
    // without a better (smaller) upper_bound this is extremely
    // wasteful in terms of memory for big n 
    long const upper_bound = 3L * static_cast<long>(n) *
                             (static_cast<long>(n) - 1L) / 2l; 
    std::vector<int> a(upper_bound, 0);
    int count = 0;

    #pragma omp parallel for schedule(dynamic) shared(a, count)
    for (int i = 1; i <= n; ++i) {
        for (int j = i + 1; j <= n; ++j) {
            double const k = std::sqrt(static_cast<double>(i * i + j * j));

            if (std::fabs(std::round(k) - k) < 1e-10) {
                int my_pos;
                #pragma omp atomic capture
                my_pos = count++;

                a[3 * my_pos] = i;
                a[3 * my_pos + 1] = j;
                a[3 * my_pos + 2] = static_cast<int>(std::round(k));
            }
        }
    }
    count *= 3;
    
    for(int i = 0; i < count; ++i) {
        std::printf("%d %s", a[i], (i + 1) % 3 ? "" : ", ");
    }
    printf("\ncount: %d", count);

    return 0;
}

EDIT: My answer was initially a reaction to a by now deleted answer using a critical section in a sub-optimal way. In the following I will present another solution which combines a critical section with using std::vector::emplace_back() to circumvent the need for upper_bound similar to Toby Speight's solution. Generally using a reduce clause like in Toby Speight's solution should be preferred over critical sections and atomic s, as reductions should scale better for big numbers of threads. In this particular case (relatively few calculations will be written to a ) and without a big amount of cores to run on, the following code might still be preferable.

#include <cstdio>
#include <cmath>

#include <tuple>
#include <vector>

int main() {
    int const n = 100;

    std::vector<std::tuple<int, int, int>> a{};
    
    // optional, might reduce number of reallocations
    a.reserve(2 * n); // 2 * n is an arbitrary choice

    #pragma omp parallel for schedule(dynamic) shared(a)
    for (int i = 1; i <= n; ++i) {
        for (int j = i + 1; j <= n; ++j) {
            double const k = std::sqrt(static_cast<double>(i * i + j * j));

            if (std::fabs(std::round(k) - k) < 1e-10) {
                #pragma omp critical
                a.emplace_back(i, j, static_cast<int>(std::round(k)));
            }
        }
    }
    long const count = 3L * static_cast<long>(a.size());
    
    for(unsigned long i = 0UL; i < a.size(); ++i) {
        std::printf("%d %d %d\n",
                    std::get<0>(a[i]), std::get<1>(a[i]), std::get<2>(a[i]));
    }
    printf("\ncount: %ld", count);

    return 0;
}

How to parallelize this array correct way using OpenMP?

Question

3 answers

solution1
3 2021-06-10 12:12:46

solution2
2 2021-06-10 12:14:00

solution3
2 ACCPTED 2021-06-10 12:38:41

How to parallelize this array correct way using OpenMP?

Question

3 answers

solution1 3 2021-06-10 12:12:46

solution2 2 2021-06-10 12:14:00

solution3 2 ACCPTED 2021-06-10 12:38:41

solution1
3 2021-06-10 12:12:46

solution2
2 2021-06-10 12:14:00

solution3
2 ACCPTED 2021-06-10 12:38:41