Possible Memory Leak in C++ MPI Program?

I'm writing some C++ MPI code for a Parallel Computing class. My code works, and I've turned the assignment in but the code is using a lot more memory that I anticipated. As I increase the number of processors the memory requirements per node are growing rapidly. This is the first real C/C++ or MPI program I've ever had to write, so I think I have a memory leak of some kind somewhere. Can someone take a look at this code and tell me where? Whenever I create a variable using new, I delete it, so I'm not sure what else I should be looking for. I suppose some of the problem could come from the objects that I'm creating, but should the destructors for these objects be called at the end of their scope to free any memory that they have allocated on the heap? I come from a heavy java background and most of my C/C++ is self taught so doing my own memory management is difficult to wrap my head around.

The problem is very simple. I have a matrix (stored as a single dimensional vector) of size MSIZE * MSIZE . Each processor is responsible for some contiguous block of data. Then I run 500 iterations where each non-edge element A[r][c] is set to the maximum of A[r][c], A[r+1][c], A[r-1][c], A[r][c+1], A[r-1][c-1] . The new value of A[r][c] is not stored until the entire update process for that iterations has finished. Processors have to communicate values that are on the boundaries to other processors.

Here's my code (I think the problem is occurring somewhere in here, but if you want to see the rest of the code (mostly helper & initialization functions) let me know and I'll post it):

#include <math.h> 
#include "mpi.h" 
#include <iostream>
#include <float.h>
#include <math.h>
#include <assert.h>
#include <algorithm>
#include <map>
#include <vector>
#include <set>
using namespace std;

#define MSIZE 4000
#define NUM_ITERATIONS 500

int myRank;
int numProcs;
int start, end;
int numIncomingMessages;

double startTime;

vector<double> a;

map<int, set<int> > neighborsToNotify;

 * Send the indices that have other processors depending on them to those processors.
 * Once the messages have been sent, receive messages until we've received all the messages
 * we are expecting to receive.
void doCommunication(){
    int messagesReceived = 0;
    map<int, set<int> >::iterator iter;
    for(iter = neighborsToNotify.begin(); iter != neighborsToNotify.end(); iter++){
        int destination = iter->first;
        set<int> indices = iter->second;

        set<int>::iterator setIter;
        for(setIter = indices.begin(); setIter != indices.end(); setIter++){
            double val = a.at(*setIter);
            MPI_Bsend(&val, 1, MPI_DOUBLE, destination, *setIter, MPI_COMM_WORLD);

        MPI_Status s;
        int flag;
        MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &s);
            double message;
            MPI_Recv(&message, 1, MPI_DOUBLE, s.MPI_SOURCE, s.MPI_TAG, MPI_COMM_WORLD, &s);
            a.at(s.MPI_TAG) = message;
            MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &s);


    while(messagesReceived < numIncomingMessages){
        MPI_Status s;
        double message;
        MPI_Recv(&message, 1, MPI_DOUBLE, s.MPI_SOURCE, s.MPI_TAG, MPI_COMM_WORLD, &s);
        a.at(s.MPI_TAG) = message;

 * Perform one timestep of iteration.
void doIteration(){
    int pos;
    vector<double> temp;
    temp.assign(end - start + 1, 0);
    for(pos = start; pos <= end; pos++){
        int i;
        double max;


        int dependents[4];
        getDependentsOfPosition(pos, dependents);

        max = a.at(pos);

        for(i = 0; i < 4; i++){

            max = std::max(max, a.at(dependents[i]));

        temp.at(pos - start) = max;

    for(pos = start; pos <= end; pos++){
        if(! isEdgeNode(pos)){
            a.at(pos) = temp.at(pos - start);

 * Compute the checksum for this processor
double computeCheck(){
    int pos;
    double sum = 0;
    for(pos = start; pos <= end; pos++){
        sum += a.at(pos) * a.at(pos);
    return sum;

int main(int argc, char *argv[]) {
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
    MPI_Comm_size(MPI_COMM_WORLD, &numProcs);





    if(myRank == 0){
        startTime = MPI_Wtime();

    int i;
    for(i = 0; i < NUM_ITERATIONS; i++){
        if(myRank == 0)
            cout << ".";

    double check = computeCheck();
    double receive = 0;

    MPI_Reduce(&check, &receive, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

    if(myRank == 0){
        cout << "n = " << MSIZE << " and p = " << numProcs << "\n";
        cout << "The total time was: " << MPI_Wtime() - startTime << " seconds \n";
        cout << "The checksum was: " << receive << " \n";

    return 0;

I do not think that you have a memory leak. But you can test this with valgrind. Be aware that the output looks terrifying.

 mpirun -n8 valgrind ./yourProgram

I think the reason is MPI. You use buffered send, so each node will generate an own buffer, the more nodes you have the more buffer will be generated. To make sure that your algorithm scales in relation to memory use unbuffered send (only for debugging purposes, as it will kill your speedup). Alternatively try to increase the matrix, at the moment you are using only 112 MB, that not really a problem to parallelize. Try to find some size so that the nearly all of the memory of one node is used.

