简体   繁体   中英

Slow heap array performance

I'm experiencing strange memory access performance problem, any ideas?

int* pixel_ptr = somewhereFromHeap;

int local_ptr[307200]; //local

//this is very slow
for(int i=0;i<307200;i++){
  pixel_ptr[i] = someCalculatedVal ;
}

//this is very slow
for(int i=0;i<307200;i++){
  pixel_ptr[i] = 1 ; //constant
}

//this is fast
for(int i=0;i<307200;i++){
  int val = pixel_ptr[i];
  local_ptr[i] = val;
}

//this is fast
for(int i=0;i<307200;i++){
  local_ptr[i] = someCalculatedVal ;
}

Tried consolidating values to local scanline

int scanline[640]; // local

//this is very slow
for(int i=xMin;i<xMax;i++){
  int screen_pos = sy*screen_width+i;
  int val = scanline[i];
  pixel_ptr[screen_pos] = val ;
}

//this is fast
for(int i=xMin;i<xMax;i++){
  int screen_pos = sy*screen_width+i;
  int val = scanline[i];
  pixel_ptr[screen_pos] = 1 ; //constant
}

//this is fast
for(int i=xMin;i<xMax;i++){
  int screen_pos = sy*screen_width+i;
  int val = i; //or a constant
  pixel_ptr[screen_pos] = val ;
}

//this is slow
for(int i=xMin;i<xMax;i++){
  int screen_pos = sy*screen_width+i;
  int val = scanline[0];
  pixel_ptr[screen_pos] = val ;
}

Any ideas? I'm using mingw with cflags -01 -std=c++11 -fpermissive.

update4: I have to say that these are snippets from my program and there are heavy code/functions running before and after. The scanline block did ran at the end of function before exit.

Now with proper test program. thks to @Iwillnotexist.

#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>

#define SIZE 307200
#define SAMPLES 1000

double local_test(){
    int local_array[SIZE];

    timeval start, end;
    long cpu_time_used_sec,cpu_time_used_usec;
    double cpu_time_used;

    gettimeofday(&start, NULL);
    for(int i=0;i<SIZE;i++){
        local_array[i] = i;
    }
    gettimeofday(&end, NULL);
    cpu_time_used_sec = end.tv_sec- start.tv_sec;
    cpu_time_used_usec = end.tv_usec- start.tv_usec;
    cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;

    return cpu_time_used;
}

double heap_test(){
    int* heap_array=new int[SIZE];

    timeval start, end;
    long cpu_time_used_sec,cpu_time_used_usec;
    double cpu_time_used;

    gettimeofday(&start, NULL);
    for(int i=0;i<SIZE;i++){
        heap_array[i] = i;
    }
    gettimeofday(&end, NULL);
    cpu_time_used_sec = end.tv_sec- start.tv_sec;
    cpu_time_used_usec = end.tv_usec- start.tv_usec;
    cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;

    delete[] heap_array;

    return cpu_time_used;
}


double heap_test2(){
    static int* heap_array = NULL;

    if(heap_array==NULL){
        heap_array = new int[SIZE];
    }

    timeval start, end;
    long cpu_time_used_sec,cpu_time_used_usec;
    double cpu_time_used;

    gettimeofday(&start, NULL);
    for(int i=0;i<SIZE;i++){
        heap_array[i] = i;
    }
    gettimeofday(&end, NULL);
    cpu_time_used_sec = end.tv_sec- start.tv_sec;
    cpu_time_used_usec = end.tv_usec- start.tv_usec;
    cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;

    return cpu_time_used;
}


int main (int argc, char** argv){
    double cpu_time_used = 0;

    for(int i=0;i<SAMPLES;i++)
        cpu_time_used+=local_test();

    printf("local: %f ms\n",cpu_time_used);

    cpu_time_used = 0;

    for(int i=0;i<SAMPLES;i++)
        cpu_time_used+=heap_test();

    printf("heap_: %f ms\n",cpu_time_used);

    cpu_time_used = 0;

    for(int i=0;i<SAMPLES;i++)
        cpu_time_used+=heap_test2();

    printf("heap2: %f ms\n",cpu_time_used);

}

Complied with no optimization.

local: 577.201000 ms

heap_: 826.802000 ms

heap2: 686.401000 ms

The first heap test with new and delete is 2x slower. (paging as suggested?)

The second heap with reused heap array is still 1.2x slower. But I guess the second test is not that practical as there tend to other codes running before and after at least for my case. For my case, my pixel_ptr of course only allocated once during prograim initialization.

But if anyone has solutions/idea to speeding things up please reply!

I'm still perplexed why heap write is so much slower than stack segment. Surely there must be some tricks to make the heap more cpu/cache flavourable.

Final update?:

I revisited, the disassemblies again and this time, suddenly I have an idea why some of my breakpoints don't activate. The program looks suspiciously shorter thus I suspect the complier might have removed the redundant dummy code I put in which explains why the local array is magically many times faster.

I was a bit curious so I did the test, and indeed I could measure a difference between stack and heap access.

The first guess would be that the generated assembly is different, but after taking a look, it is actually identical for heap and stack (which makes sense, memory shouldn't be discriminated).

If the assembly is the same, then the difference must come from the paging mechanism . The guess is that on the stack, the pages are already allocated, but on the heap, first access cause a page fault and page allocation (invisible, it all happens at kernel level). To verify this, I did the same test, but first I would access the heap once before measuring. The test gave identical times for stack and heap. To be sure, I also did a test in which I first accessed the heap, but only every 4096 bytes (every 1024 int), then 8192, because a page is usually 4096 bytes long. The result is that accessing only every 4096 bytes also gives the same time for heap and stack, but accessing every 8192 gives a difference, but not as much as with no previous access at all. This is because only half of the pages were accessed and allocated beforehand.

So the answer is that on the stack, memory pages are already allocated, but on the heap, pages are allocated on-the-fly. This depends on the OS paging policy, but all major PC OSes probably have a similar one.

For all the tests I used Windows, with MS compiler targeting x64.

EDIT: For the test, I measured a single, larger loop, so there was only one access at each memory location. delete ing the array and measuring the same loop multiple time should give similar times for stack and heap, because delete ing memory probably don't de-allocate the pages, and they are already allocated for the next loop (if the next new allocated on the same space).

The following two code examples should not differ in runtime with a good compiler setting. Probably your compiler will generate the same code:

//this is fast
for(int i=0;i<307200;i++){
  int val = pixel_ptr[i];
  local_ptr[i] = val;
}

//this is fast
for(int i=0;i<307200;i++){
  local_ptr[i] = pixel_ptr[i];
}

Please try to increase optimization setting.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM