简体   繁体   中英

How can I parallelize DFS using OpenMP?

I am trying to figure it out with OpenMP. I need to parallelize depth-first traversal.

This is the algorithm:

    void dfs(int v){
        
        visited[v] = true;
        
        for (int i = 0; i < g[v].size(); ++i) {
        
            if (!visited[g[v][i]]) {
                dfs(g[v][i]);
            }
        }
        
    }

I try:

    #include <iostream>
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <math.h>
    #include <queue>
    #include <sstream>
    #include <omp.h>
    #include <fstream>
    #include <vector>
    using namespace std;
    vector<int> output; 
    vector<bool> visited;
    vector < vector <int> >g;
    int global = 0;
    void dfs(int v)
    {
        printf(" potoki %i",omp_get_thread_num());
        //cout<<endl;
        visited[v] = true;
        /*for(int i =0;i<visited.size();i++){
            cout <<visited[i]<< " ";
        }*/
        //cout<<endl;
        //global++;
    
        output.push_back(v);
        int i;
        //printf(" potoki %i",omp_get_num_threads());
        //cout<<endl;
    
    
        for (i = 0; i < g[v].size(); ++i) {
            if (!visited[g[v][i]]) {
    #pragma omp task shared(visited)
    {
    #pragma omp critical
    {
            dfs(g[v][i]);
    }
    }
                  
                }
             }
    }
    
        main(){
            omp_set_num_threads(5);
            int length = 1000;
            int e = 4;
            for (int i = 0; i < length; i++) {
                visited.push_back(false);
            }
            
            int limit = (length / 2) - 1;
        
            g.resize(length);
            for (int x = 0; x < g.size(); x++) {
                int p=0;
                while(p<e){
                    int new_e = rand() % length ;
                    if(new_e!=x){
                        bool check=false;
                        for(int c=0;c<g[x].size();c++){
                            if(g[x][c]==new_e){
                                check=true;
                            }
                        }
                        if(check==false){
                             g[x].push_back(new_e);
                             p++;
                        }
                    }
                       
                }
        
            }
        
             ofstream fin("input.txt");
        
                for (int i = 0; i < g.size(); i++)
                {
                    for (int j = 0; j < g[i].size(); j++)
                    {
                        fin << g[i][j] << " ";
        
                    }
        
                    fin << endl;
                }   
            fin.close();
        
            /*for (int x = 0; x < g.size(); x++) {
                for(int j=0;j<g[x].size();j++){
                    printf(" %i ", g[x][j]);
        
                }
            printf(" \n ");
        
        
            }*/
        
            double start;
            double end;
            start = omp_get_wtime();
        #pragma omp parallel 
        
        {
        #pragma omp single
        { 
        
            dfs(0); 
        }
        
        
        }
        
                
            end = omp_get_wtime();
            cout << endl;
            printf("Work took %f seconds\n", end - start);
            cout<<global;
            ofstream fout("output.txt");
        
             for(int i=0;i<output.size();i++){
                    fout<<output[i]<<" ";
                }
            fout.close();
    }

Graph "g" is generated and written to the file input.txt. The result of the program is written to the file output.txt.

But this does not work on any number of threads and is much slower. I tried to use taskwait but in that case, only one thread works.

A critical section protects a block of code so that no more than one thread can execute it at any given time. Having the recursive call to dfs() inside a critical section means that no two tasks could make that call simultaneously. Moreover, since dfs() is recursive, any top-level task will have to wait for the entire recursion to finish before it could exit the critical section and allow a task in another thread to execute.

You need to sychronise where it will not interfere with the recursive call and only protect the update to shared data that does not provide its own internal synchronisation. This is the original code:

void dfs(int v){
    visited[v] = true;
    for (int i = 0; i < g[v].size(); ++i) {
        if (!visited[g[v][i]]) {
            dfs(g[v][i]);
        }
    }
}

A naive but still parallel version would be:

void dfs(int v){
    #pragma omp critical
    {
        visited[v] = true;
        for (int i = 0; i < g[v].size(); ++i) {
            if (!visited[g[v][i]]) {
                #pragma omp task
                dfs(g[v][i]);
            }
        }
    }
}

Here, the code leaves the critical section as soon as the tasks are created. The problem here is that the entire body of dfs() is one critical section, which means that even if there are 1000 recursive calls in parallel, they will execute one after another sequentially and not in parallel. It will even be slower than the sequential version because of the constant cache invalidation and the added OpenMP overhead.

One important note is that OpenMP critical sections, just as regular OpenMP locks, are not re-entrant , so a thread could easily deadlock itself due to encountering the same critical section in a recursive call from inside that same critical section, eg, if a task gets executed immediately instead of being postponed. It is therefore better to implement a re-entrant critical section using OpenMP nested locks.

The reason for that code being slower than sequential is that it does nothing else except traversing the graph. If it was doing some additional work at each node, eg, accessing data or computing node-local properties, then this work could be inserted between updating visited and the loop over the unvisited neighbours:

void dfs(int v){
    #pragma omp critical
    visited[v] = true;

    // DO SOME WORK

    #pragma omp critical
    {
        for (int i = 0; i < g[v].size(); ++i) {
            if (!visited[g[v][i]]) {
                #pragma omp task
                dfs(g[v][i]);
            }
        }
    }
}

The parts in the critical sections will still execute sequentially, but the processing represented by // DO SOME WORK will overlap in parallel.

There are tricks to speed things up by reducing the lock contention introduced by having one big lock / critical section. One could, eg, use a set of OpenMP locks and map the index of visited onto those locks, eg, using simple modulo arithmetic as described here . It is also possible to stop creating tasks at certain level of recursion and call a sequential version of dfs() instead.

void p_dfs(int v)
{
#pragma omp critical
    visited[v] = true;
    
#pragma omp parallel for    
    for (int i = 0; i < graph[v].size(); ++i)
        {
        #pragma omp critical
            if (!visited[graph[v][i]])
            {
        #pragma omp task
                p_dfs(graph[v][i]);
            }
        }

}

OpenMP is good for data-parallel code, when the amount of work is known in advance. Doesn't work well for graph algorithms like this one.

If the only thing you do is what's in your code (push elements into a vector), parallelism is going to make it slower. Even if you have many gigabytes of data on your graph, the bottleneck is memory not compute, multiple CPU cores won't help. Also, if all threads gonna push results to the same vector, you'll need synchronization. Also, reading memory recently written by another CPU core is expensive on modern processors, even more so than a cache miss.

If you have some substantial CPU-bound work besides just copying integers, look for alternatives to OpenMP. On Windows, I usually use CreateThreadpoolWork and SubmitThreadpoolWork APIs. On iOS and OSX, see grand central dispatch. On Linux see cp_thread_pool_create(3) but unlike the other two I don't have any hands-on experience with it, just found the docs.

Regardless on the thread pool implementation you gonna use, you'll then be able to post work to the thread pool dynamically, as you're traversing the graph. OpenMP also has a thread pool under the hood, but the API is not flexible enough for dynamic parallelism.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM