C optimization - low level code

Question

I'm trying to write a memory allocator comparable to dlmalloc which is the malloc used in glibc. dlmalloc is a best-fit allocator with block splitting and it retains a pool of recently used blocks before consolidating the blocks into large ones again. The allocator I'm writing does it first fit style instead.

My problem is two fold : (1) that the test times for my code are highly irregular compared to those of glibc malloc and (2) some days the average run time on my code will be 3 to 4 times greater; (2) is not a big deal but I would like to understand why glibc malloc does not suffer in the same way. Further in this post shows a sample of the behavior described in (1) between malloc and my code. Sometimes the times a batch of a 1000 tests will have an average time much higher than that of malloc( problem (2) above), sometimes the averages are the same. But, the test times for a batch of tests on my code are always highly irregular (problem (1) above); meaning that there are jumps in time in a batch of tests to 20 times the average and these jumps are interspersed in otherwise regular (close to average) times. glibc malloc does not do this.

The code I'm working on follows.

===================================

/* represent an allocated/unallocated  block of memory */
struct Block {

    /* previous allocated or unallocated block needed for consolidation but not used in allocation */
    Block* prev;
    /* 1 if allocated and 0 if not */
    unsigned int tagh;
   /* previous unallocated block */
   Block* prev_free;
   /* next unallocated block  */
   Block* next_free;
   /* size of current block */
   unsigned int size;
};

#define CACHE_SZ 120000000

/* array to be managed by allocator */
char arr[CACHE_SZ] __attribute__((aligned(4)));

/* initialize the contiguous memory located at arr for allocator */
void init_cache(){
/* setup list head node that does not change */
   Block* a = (Block*)  arr;
  a->prev = 0; 
  a->tagh = 1;
  a->prev_free = 0;
  a->size = 0;

/* setup the usable data block */
  Block* b = (Block*) (arr + sizeof(Block));
  b->prev = a; 
  b->tagh = 0;
  b->prev_free = a;
  b->size = CACHE_SZ - 3*sizeof(Block);
  a->next_free = b;

/* setup list tail node that does not change */
  Block* e = (Block*)((char*)arr + CACHE_SZ - sizeof(Block)); 
  e->prev = b;
  e->tagh = 1;
  e->prev_free = b;
  e->next_free = 0;
  e->size = 0;
  b->next_free = e;
}

char* alloc(unsigned int size){
  register Block* current = ((Block*) arr)->next_free; 
  register Block* new_block;

/* search for a first-fit block */

   while(current != 0){
       if( current->size >= size + sizeof(Block)) goto good;
       current = current->next_free;
   }

/* what to do if no decent size block found */
   if( current == 0) {
       return 0;
   }

/* good block found */
good:
/* if block size is exact return it */
   if( current->size == size){
       if(current->next_free != 0) current->next_free->prev_free = current->prev_free;
       if(current->prev_free != 0) current->prev_free->next_free = current->next_free;
       return (char* ) current + sizeof(Block);
   }

/* otherwise split the block */

   current->size -= size + sizeof(Block); 

    new_block = (Block*)( (char*)current + sizeof(Block) + current->size);
    new_block->size = size;
    new_block->prev = current;
    new_block->tagh = 1;
   ((Block*)((char*) new_block + sizeof(Block) + new_block->size ))->prev = new_block;

   return (char* ) new_block + sizeof(Block);
}

main(int argc, char** argv){
    init_cache();
    int count = 0;

/* the count considers the size of the cache arr */
    while(count < 4883){

/* the following line tests malloc; the quantity(1024*24) ensures word alignment */
   //char * volatile p = (char *) malloc(1024*24);
/* the following line tests above code in exactly the same way */
    char * volatile p = alloc(1024*24);
        count++;

    }
}

=====================================

I compile the above code simply with :

g++ -O9 alloc.c

and run a simple test that always goes to split the block and never returns an exact size block :

bash$ for((i=0; i<1000; i++)); do (time ./a.out) 2>&1|grep real; done

sample outputs of the test for my code and glibc malloc are as follows :

my code :

real    0m0.023s
real    0m0.109s    <----- irregular jump >
real    0m0.024s
real    0m0.086s
real    0m0.022s
real    0m0.104s    <----- again irregular jump >
real    0m0.023s
real    0m0.023s
real    0m0.098s
real    0m0.023s
real    0m0.097s
real    0m0.024s
real    0m0.091s
real    0m0.023s
real    0m0.025s
real    0m0.088s
real    0m0.023s
real    0m0.086s
real    0m0.024s
real    0m0.024s

malloc code (nice and regular stays close to 20ms) :

real    0m0.025s
real    0m0.024s
real    0m0.024s
real    0m0.026s
real    0m0.024s
real    0m0.026s
real    0m0.025s
real    0m0.026s
real    0m0.026s
real    0m0.025s
real    0m0.025s
real    0m0.024s
real    0m0.024s
real    0m0.024s
real    0m0.025s
real    0m0.026s
real    0m0.025s

Note the malloc code times are more regular. At other, unpredictable times, my code has 0m0.070s instead of 0m0.020s so that the average run time is close to 70ms instead of 25ms (problem (2) above) but this is not shown here. In this case I was lucky enough to have it running close to the average of malloc (25ms)

The questions are, (1) how can I modify my code to have more regular times such as glibc malloc ? and (2) how can I make it even faster than glibc malloc if this is possible because I have read that dlmalloc is a characteristically balanced allocator and is not the fastest (only considering split/best-fit/first-fit allocators not others) ?

Answer 1

Don't use 'real' time: Try 'user' + 'sys'. Average over a large number of iterations. The Problem is twofold: (a) your process is not alone on the processor, it's interrupted depending on what other processes do, (b) time measurement with time has a granularity. I'm not sure what it is today, but in former times it was just the size of a time slice => 1/100 s.

Answer 2

Right, I have compared both solutions, and run them in a few different variants. I don't KNOW for sure what the problem is, but my speculation is that a large part of the time is spent on "creating a large contiguous slab of 1200000000 bytes". If I reduce the size, and still perform the same number of allocatoions, the time goes down.

Another piece of evidence pointing at this is that the system time is a large portion of real time, where user time is nearly nothing.

Now, on MY system, it doesn't really wobble that much up and down once I've run these things a few times with high memory load. That's quite likely because once I've swapped out a bunch of old gunk that has accumulated in the memory, the system simply has plenty of "spare" pages to use for my process. When the memory is more constrained (because I've let the system go do some other things, such as do some database work on the "website" that I experiment on [it's a "sandbox" version of a real website, so it has real data in the database, and can quickly fill memory and such], I get more variation until I clear out memory quite a bit again.

But I think the key to the "mystery" is that the system time is the vast portion of the time used. It's also notable that when using malloc with large blocks, the memory is actually not being "really allocated". And when allocating smaller blocks, it seems that malloc is actually more clever in some way, and is faster than the "optimised" alloc - at least for larger amounts of memory. Don't ask me exactly how that works.

Here's some evidence:

I changed the main in the code to do:

#define BLOCK_SIZE (CACHE_SZ / 5000)

int main(int argc, char** argv){
    init_cache();
    int count = 0;
    int failed = 0;
    size_t size = 0;

/* the count considers the size of the cache arr */
    while(count < int((CACHE_SZ / BLOCK_SIZE) * 0.96) ){

/* the following line tests malloc; the quantity(1024*24) ensures word alignment */
   //char * volatile p = (char *) malloc(1024*24);
/* the following line tests above code in exactly the same way */
    char * volatile p;
    if (argc > 1) 
        p = (char *)malloc(BLOCK_SIZE);
    else
        p = alloc(BLOCK_SIZE);
    if (p == 0)
    {
        failed++;
        puts("p = NULL\n");
    }
    count++;
    size += BLOCK_SIZE;
    }
    printf("Count = %d, total=%zd, failed=%d\n", count, size, failed);
}

And then varied CACHE_SZ and ran with or without an argument to use alloc or malloc option:

So, with cache-size 12000000 (12MB):

The figures are:

real    0m0.008s
user    0m0.001s
sys 0m0.007s
Count = 4800, total=11520000, failed=0

real    0m0.007s
user    0m0.000s
sys 0m0.006s
Count = 4800, total=11520000, failed=0

real    0m0.008s
user    0m0.001s
sys 0m0.006s
Count = 4800, total=11520000, failed=0

real    0m0.014s
user    0m0.003s
sys 0m0.010s

And a few runs with malloc :

real    0m0.010s
user    0m0.000s
sys 0m0.009s
Count = 4800, total=11520000, failed=0

real    0m0.017s
user    0m0.001s
sys 0m0.015s
Count = 4800, total=11520000, failed=0

real    0m0.012s
user    0m0.001s
sys 0m0.010s
Count = 4800, total=11520000, failed=0

real    0m0.021s
user    0m0.007s
sys 0m0.013s
Count = 4800, total=11520000, failed=0

real    0m0.010s
user    0m0.001s
sys 0m0.008s
Count = 4800, total=11520000, failed=0

real    0m0.009s
user    0m0.001s
sys 0m0.007s

Making the cache-size 10x larger gives the following results for alloc :

real    0m0.038s
user    0m0.001s
sys 0m0.036s
Count = 4800, total=115200000, failed=0

real    0m0.040s
user    0m0.001s
sys 0m0.037s
Count = 4800, total=115200000, failed=0

real    0m0.045s
user    0m0.001s
sys 0m0.043s
Count = 4800, total=115200000, failed=0

real    0m0.044s
user    0m0.001s
sys 0m0.043s
Count = 4800, total=115200000, failed=0

real    0m0.046s
user    0m0.001s
sys 0m0.043s
Count = 4800, total=115200000, failed=0

real    0m0.042s
user    0m0.000s
sys 0m0.042s

And with malloc :

real    0m0.026s
user    0m0.004s
sys 0m0.021s
Count = 4800, total=115200000, failed=0

real    0m0.027s
user    0m0.002s
sys 0m0.023s
Count = 4800, total=115200000, failed=0

real    0m0.022s
user    0m0.002s
sys 0m0.018s
Count = 4800, total=115200000, failed=0

real    0m0.016s
user    0m0.001s
sys 0m0.015s
Count = 4800, total=115200000, failed=0

real    0m0.027s
user    0m0.002s
sys 0m0.024s
Count = 4800, total=115200000, failed=0

And another 10x with alloc :

real    0m1.408s
user    0m0.002s
sys 0m1.395s
Count = 4800, total=1152000000, failed=0

real    0m1.517s
user    0m0.001s
sys 0m1.505s
Count = 4800, total=1152000000, failed=0

real    0m1.478s
user    0m0.000s
sys 0m1.466s
Count = 4800, total=1152000000, failed=0

real    0m1.401s
user    0m0.001s
sys 0m1.389s
Count = 4800, total=1152000000, failed=0

real    0m1.445s
user    0m0.002s
sys 0m1.433s
Count = 4800, total=1152000000, failed=0

real    0m1.468s
user    0m0.000s
sys 0m1.458s
Count = 4800, total=1152000000, failed=0

With malloc :

real    0m0.020s
user    0m0.002s
sys 0m0.017s
Count = 4800, total=1152000000, failed=0

real    0m0.022s
user    0m0.001s
sys 0m0.020s
Count = 4800, total=1152000000, failed=0

real    0m0.027s
user    0m0.005s
sys 0m0.021s
Count = 4800, total=1152000000, failed=0

real    0m0.029s
user    0m0.002s
sys 0m0.026s
Count = 4800, total=1152000000, failed=0

real    0m0.020s
user    0m0.001s
sys 0m0.019s
Count = 4800, total=1152000000, failed=0

If we change the code to make BLOCK_SIZE a constant of 1000, the difference between the alloc and malloc gets much smaller. Here's the alloc results:

 Count = 1080000, total=1080000000, failed=0

real    0m1.183s
user    0m0.028s
sys 0m1.137s
Count = 1080000, total=1080000000, failed=0

real    0m1.179s
user    0m0.017s
sys 0m1.143s
Count = 1080000, total=1080000000, failed=0

real    0m1.196s
user    0m0.026s
sys 0m1.152s
Count = 1080000, total=1080000000, failed=0

real    0m1.197s
user    0m0.023s
sys 0m1.157s
Count = 1080000, total=1080000000, failed=0

real    0m1.188s
user    0m0.021s
sys 0m1.147s

And now malloc :

Count = 1080000, total=1080000000, failed=0

real    0m0.582s
user    0m0.063s
sys 0m0.482s
Count = 1080000, total=1080000000, failed=0

real    0m0.586s
user    0m0.062s
sys 0m0.489s
Count = 1080000, total=1080000000, failed=0

real    0m0.582s
user    0m0.059s
sys 0m0.483s
Count = 1080000, total=1080000000, failed=0

real    0m0.590s
user    0m0.064s
sys 0m0.477s
Count = 1080000, total=1080000000, failed=0

real    0m0.586s
user    0m0.075s
sys 0m0.473s

C optimization - low level code

Question

2 answers

solution1
5 ACCPTED 2013-09-05 15:14:04

solution2
5 2013-09-05 15:43:49

C optimization - low level code

Question

2 answers

solution1 5 ACCPTED 2013-09-05 15:14:04

solution2 5 2013-09-05 15:43:49

solution1
5 ACCPTED 2013-09-05 15:14:04

solution2
5 2013-09-05 15:43:49