简体   繁体   中英

C++ Sharing Large Arrays and Data Structures Between MPI Processes

I have a program that currently generates large arrays and matrices that can be upwards of 10GB in size. The program uses MPI to parallelize workloads, but is limited by the fact that each process needs its own copy of the array or matrix in order to perform its portion of the computation. The memory requirements make this problem unfeasible with a large number of MPI processes and so I have been looking into Boost::Interprocess as a means of sharing data between MPI processes.

So far, I have come up with the following which creates a large vector and parallelizes the summation of its elements:

#include <cstdlib>
#include <ctime>
#include <functional>
#include <iostream>
#include <string>
#include <utility>

#include <boost/interprocess/managed_shared_memory.hpp>
#include <boost/interprocess/containers/vector.hpp>
#include <boost/interprocess/allocators/allocator.hpp>
#include <boost/tuple/tuple_comparison.hpp>
#include <mpi.h>

typedef boost::interprocess::allocator<double, boost::interprocess::managed_shared_memory::segment_manager> ShmemAllocator;
typedef boost::interprocess::vector<double, ShmemAllocator> MyVector;

const std::size_t vector_size = 1000000000;
const std::string shared_memory_name = "vector_shared_test.cpp";

int main(int argc, char **argv) {
    int numprocs, rank;

    MPI::Init();
    numprocs = MPI::COMM_WORLD.Get_size();
    rank = MPI::COMM_WORLD.Get_rank();

    if(numprocs >= 2) {
        if(rank == 0) {
            std::cout << "On process rank " << rank << "." << std::endl;
            std::time_t creation_start = std::time(NULL);

            boost::interprocess::shared_memory_object::remove(shared_memory_name.c_str());
            boost::interprocess::managed_shared_memory segment(boost::interprocess::create_only, shared_memory_name.c_str(), size_t(12000000000));

            std::cout << "Size of double: " << sizeof(double) << std::endl;
            std::cout << "Allocated shared memory: " << segment.get_size() << std::endl;

            const ShmemAllocator alloc_inst(segment.get_segment_manager());

            MyVector *myvector = segment.construct<MyVector>("MyVector")(alloc_inst);

            std::cout << "myvector max size: " << myvector->max_size() << std::endl;

            for(int i = 0; i < vector_size; i++) {
                myvector->push_back(double(i));
            }

            std::cout << "Vector capacity: " << myvector->capacity() << " | Memory Free: " << segment.get_free_memory() << std::endl;

            std::cout << "Vector creation successful and took " << std::difftime(std::time(NULL), creation_start) << " seconds." << std::endl;
        }

        std::flush(std::cout);
        MPI::COMM_WORLD.Barrier();

        std::time_t summing_start = std::time(NULL);

        std::cout << "On process rank " << rank << "." << std::endl;
        boost::interprocess::managed_shared_memory segment(boost::interprocess::open_only, shared_memory_name.c_str());

        MyVector *myvector = segment.find<MyVector>("MyVector").first;
        double result = 0;

        for(int i = rank; i < myvector->size(); i = i + numprocs) {
            result = result + (*myvector)[i];
        }
        double total = 0;
        MPI::COMM_WORLD.Reduce(&result, &total, 1, MPI::DOUBLE, MPI::SUM, 0);

        std::flush(std::cout);
        MPI::COMM_WORLD.Barrier();

        if(rank == 0) {
            std::cout << "On process rank " << rank << "." << std::endl;
            std::cout << "Vector summing successful and took " << std::difftime(std::time(NULL), summing_start) << " seconds." << std::endl;

            std::cout << "The arithmetic sum of the elements in the vector is " << total << std::endl;
            segment.destroy<MyVector>("MyVector");
        }

        std::flush(std::cout);
        MPI::COMM_WORLD.Barrier();

        boost::interprocess::shared_memory_object::remove(shared_memory_name.c_str());
    }

    sleep(300);
    MPI::Finalize();

    return 0;
}

I noticed that this causes the entire shared object to be mapped into each processes' virtual memory space - which is an issue with our computing cluster as it limits virtual memory to be the same as physical memory. Is there a way to share this data structure without having to map out the entire shared memory space - perhaps in the form of sharing a pointer of some kind? Would trying to access unmapped shared memory even be defined behavior? Unfortunately the operations we are performing on the array means that each process eventually needs to access every element in it (although not concurrently - I suppose its possible to break up the shared array into pieces and trade portions of the array for those you need, but this is not ideal).

Since the data you want to share is so large, it may be more practical to treat the data as a true file, and use file operations to read the data that you want. Then, you do not need to use shared memory to share the file, just let each process read directly from the file system.

ifstream file ("data.dat", ios::in | ios::binary);
file.seekg(someOffset, ios::beg);
file.read(array, sizeof(array));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM