簡體   English   中英

C優化 - 低級代碼

[英]C optimization - low level code

我正在嘗試編寫一個與dlmalloc相當的內存分配器,它是glibc中使用的malloc。 dlmalloc是一個具有塊拆分的最佳匹配器,它在將塊再次合並為大塊之前保留了最近使用的塊池。 我正在寫的分配器首先適合它。

我的問題有兩個:(1)我的代碼的測試時間與glibc malloc相比是非常不規則的;(2)有些日子我的代碼的平均運行時間將是3到4倍; (2)並不是什么大問題,但我想理解為什么glibc malloc不會以同樣的方式受到影響。 此帖還顯示了malloc和我的代碼之間(1)中描述的行為示例。 有時,一批1000次測試的平均時間遠遠高於malloc的時間(上面的問題(2)),有時平均值是相同的。 但是,對我的代碼進行一批測試的測試時間總是非常不規則(上面的問題(1)); 意味着在一批測試中有時間跳躍到平均值的20倍,並且這些跳躍散布在其他常規(接近平均)時間內。 glibc malloc不這樣做。

我正在編寫的代碼如下。

===================================

/* represent an allocated/unallocated  block of memory */
struct Block {

    /* previous allocated or unallocated block needed for consolidation but not used in allocation */
    Block* prev;
    /* 1 if allocated and 0 if not */
    unsigned int tagh;
   /* previous unallocated block */
   Block* prev_free;
   /* next unallocated block  */
   Block* next_free;
   /* size of current block */
   unsigned int size;
};

#define CACHE_SZ 120000000

/* array to be managed by allocator */
char arr[CACHE_SZ] __attribute__((aligned(4)));

/* initialize the contiguous memory located at arr for allocator */
void init_cache(){
/* setup list head node that does not change */
   Block* a = (Block*)  arr;
  a->prev = 0; 
  a->tagh = 1;
  a->prev_free = 0;
  a->size = 0;

/* setup the usable data block */
  Block* b = (Block*) (arr + sizeof(Block));
  b->prev = a; 
  b->tagh = 0;
  b->prev_free = a;
  b->size = CACHE_SZ - 3*sizeof(Block);
  a->next_free = b;

/* setup list tail node that does not change */
  Block* e = (Block*)((char*)arr + CACHE_SZ - sizeof(Block)); 
  e->prev = b;
  e->tagh = 1;
  e->prev_free = b;
  e->next_free = 0;
  e->size = 0;
  b->next_free = e;
}

char* alloc(unsigned int size){
  register Block* current = ((Block*) arr)->next_free; 
  register Block* new_block;

/* search for a first-fit block */

   while(current != 0){
       if( current->size >= size + sizeof(Block)) goto good;
       current = current->next_free;
   }

/* what to do if no decent size block found */
   if( current == 0) {
       return 0;
   }

/* good block found */
good:
/* if block size is exact return it */
   if( current->size == size){
       if(current->next_free != 0) current->next_free->prev_free = current->prev_free;
       if(current->prev_free != 0) current->prev_free->next_free = current->next_free;
       return (char* ) current + sizeof(Block);
   }

/* otherwise split the block */

   current->size -= size + sizeof(Block); 

    new_block = (Block*)( (char*)current + sizeof(Block) + current->size);
    new_block->size = size;
    new_block->prev = current;
    new_block->tagh = 1;
   ((Block*)((char*) new_block + sizeof(Block) + new_block->size ))->prev = new_block;

   return (char* ) new_block + sizeof(Block);
}

main(int argc, char** argv){
    init_cache();
    int count = 0;

/* the count considers the size of the cache arr */
    while(count < 4883){

/* the following line tests malloc; the quantity(1024*24) ensures word alignment */
   //char * volatile p = (char *) malloc(1024*24);
/* the following line tests above code in exactly the same way */
    char * volatile p = alloc(1024*24);
        count++;

    }
}

=====================================

我用以下代碼編譯上面的代碼:

g ++ -O9 alloc.c

並運行一個簡單的測試,總是分割塊,永遠不會返回一個確切的大小塊:

bash $ for((i = 0; i <1000; i ++)); 做(時間./a.out)2>&1 | grep real; DONE

我的代碼和glibc malloc的測試樣本輸出如下:

我的代碼:

real    0m0.023s
real    0m0.109s    <----- irregular jump >
real    0m0.024s
real    0m0.086s
real    0m0.022s
real    0m0.104s    <----- again irregular jump >
real    0m0.023s
real    0m0.023s
real    0m0.098s
real    0m0.023s
real    0m0.097s
real    0m0.024s
real    0m0.091s
real    0m0.023s
real    0m0.025s
real    0m0.088s
real    0m0.023s
real    0m0.086s
real    0m0.024s
real    0m0.024s

malloc代碼(漂亮和常規保持接近20毫秒):

real    0m0.025s
real    0m0.024s
real    0m0.024s
real    0m0.026s
real    0m0.024s
real    0m0.026s
real    0m0.025s
real    0m0.026s
real    0m0.026s
real    0m0.025s
real    0m0.025s
real    0m0.024s
real    0m0.024s
real    0m0.024s
real    0m0.025s
real    0m0.026s
real    0m0.025s

請注意,malloc代碼時間更加規則。 在其他不可預測的時間,我的代碼有0m0.070s而不是0m0.020s,因此平均運行時間接近70ms而不是25ms(上面的問題(2)),但這里沒有顯示。 在這種情況下,我很幸運,它的運行接近malloc的平均值(25ms)

問題是,(1)如何修改我的代碼以使更多的常規時間如glibc malloc? (2)如果可能的話,我怎么能比glibc malloc更快,因為我已經讀過dlmalloc是一個特征平衡的分配器並且不是最快的(只考慮分裂/最佳擬合/首先適合的分配器而不是其他分配器) ?

不要使用'真實'時間:嘗試'用戶'+'sys'。 大量迭代的平均值。 問題有兩個:(a)您的過程並不是處理器上的唯一過程,而是根據其他過程的作用而中斷,(b)時間測量具有粒度。 我不確定它今天是什么,但在以前它只是時間片的大小=> 1/100秒。

是的,我比較了兩種解決方案,並以幾種不同的方式運行它們。 我不知道問題是什么,但我的猜測是,大部分時間花在“創建一個1200000000字節的大型連續板”上。 如果我減小了大小,並且仍然執行相同數量的分配,則時間會減少。

另一個指向這一點的證據是system時間是real的很大一部分,其中user時間幾乎為零。

現在,在我的系統上,一旦我用高內存負載運行這些東西幾次,它就不會真正擺動那么多。 這很可能是因為一旦我換掉了一堆積累在內存中的舊垃圾,系統就會有足夠的“備用”頁面用於我的進程。 當內存受到更多限制時(因為我讓系統去做其他一些事情,比如在我試驗的“網站”上做一些數據庫工作[它是真實網站的“沙盒”版本,所以它有數據庫中的真實數據,並可以快速填充內存等,我得到更多的變化,直到我再次清理內存。

但我認為“神秘”的關鍵在於系統時間是所用時間的絕大部分。 值得注意的是,當使用具有大塊的malloc時,內存實際上並未“真正分配”。 在分配較小的塊時,似乎malloc實際上在某種程度上更加聰明,並且比“優化”的分配更快 - 至少對於更大的內存量。 不要問我到底是怎么回事。

這里有一些證據:

我改變了代碼中的main來做:

#define BLOCK_SIZE (CACHE_SZ / 5000)

int main(int argc, char** argv){
    init_cache();
    int count = 0;
    int failed = 0;
    size_t size = 0;

/* the count considers the size of the cache arr */
    while(count < int((CACHE_SZ / BLOCK_SIZE) * 0.96) ){

/* the following line tests malloc; the quantity(1024*24) ensures word alignment */
   //char * volatile p = (char *) malloc(1024*24);
/* the following line tests above code in exactly the same way */
    char * volatile p;
    if (argc > 1) 
        p = (char *)malloc(BLOCK_SIZE);
    else
        p = alloc(BLOCK_SIZE);
    if (p == 0)
    {
        failed++;
        puts("p = NULL\n");
    }
    count++;
    size += BLOCK_SIZE;
    }
    printf("Count = %d, total=%zd, failed=%d\n", count, size, failed);
}

然后改變CACHE_SZ並使用或不使用參數運行以使用allocmalloc選項:

因此,緩存大小為12000000(12MB):

數字是:

real    0m0.008s
user    0m0.001s
sys 0m0.007s
Count = 4800, total=11520000, failed=0

real    0m0.007s
user    0m0.000s
sys 0m0.006s
Count = 4800, total=11520000, failed=0

real    0m0.008s
user    0m0.001s
sys 0m0.006s
Count = 4800, total=11520000, failed=0

real    0m0.014s
user    0m0.003s
sys 0m0.010s

malloc運行:

real    0m0.010s
user    0m0.000s
sys 0m0.009s
Count = 4800, total=11520000, failed=0

real    0m0.017s
user    0m0.001s
sys 0m0.015s
Count = 4800, total=11520000, failed=0

real    0m0.012s
user    0m0.001s
sys 0m0.010s
Count = 4800, total=11520000, failed=0

real    0m0.021s
user    0m0.007s
sys 0m0.013s
Count = 4800, total=11520000, failed=0

real    0m0.010s
user    0m0.001s
sys 0m0.008s
Count = 4800, total=11520000, failed=0

real    0m0.009s
user    0m0.001s
sys 0m0.007s

使緩存大小增加10倍會為alloc提供以下結果:

real    0m0.038s
user    0m0.001s
sys 0m0.036s
Count = 4800, total=115200000, failed=0

real    0m0.040s
user    0m0.001s
sys 0m0.037s
Count = 4800, total=115200000, failed=0

real    0m0.045s
user    0m0.001s
sys 0m0.043s
Count = 4800, total=115200000, failed=0

real    0m0.044s
user    0m0.001s
sys 0m0.043s
Count = 4800, total=115200000, failed=0

real    0m0.046s
user    0m0.001s
sys 0m0.043s
Count = 4800, total=115200000, failed=0

real    0m0.042s
user    0m0.000s
sys 0m0.042s

並使用malloc

real    0m0.026s
user    0m0.004s
sys 0m0.021s
Count = 4800, total=115200000, failed=0

real    0m0.027s
user    0m0.002s
sys 0m0.023s
Count = 4800, total=115200000, failed=0

real    0m0.022s
user    0m0.002s
sys 0m0.018s
Count = 4800, total=115200000, failed=0

real    0m0.016s
user    0m0.001s
sys 0m0.015s
Count = 4800, total=115200000, failed=0

real    0m0.027s
user    0m0.002s
sys 0m0.024s
Count = 4800, total=115200000, failed=0

和另外10倍的alloc

real    0m1.408s
user    0m0.002s
sys 0m1.395s
Count = 4800, total=1152000000, failed=0

real    0m1.517s
user    0m0.001s
sys 0m1.505s
Count = 4800, total=1152000000, failed=0

real    0m1.478s
user    0m0.000s
sys 0m1.466s
Count = 4800, total=1152000000, failed=0

real    0m1.401s
user    0m0.001s
sys 0m1.389s
Count = 4800, total=1152000000, failed=0

real    0m1.445s
user    0m0.002s
sys 0m1.433s
Count = 4800, total=1152000000, failed=0

real    0m1.468s
user    0m0.000s
sys 0m1.458s
Count = 4800, total=1152000000, failed=0

使用malloc

real    0m0.020s
user    0m0.002s
sys 0m0.017s
Count = 4800, total=1152000000, failed=0

real    0m0.022s
user    0m0.001s
sys 0m0.020s
Count = 4800, total=1152000000, failed=0

real    0m0.027s
user    0m0.005s
sys 0m0.021s
Count = 4800, total=1152000000, failed=0

real    0m0.029s
user    0m0.002s
sys 0m0.026s
Count = 4800, total=1152000000, failed=0

real    0m0.020s
user    0m0.001s
sys 0m0.019s
Count = 4800, total=1152000000, failed=0

如果我們更改代碼以使BLOCK_SIZE為常量1000,則allocmalloc之間的差異會小得多。 這是alloc結果:

 Count = 1080000, total=1080000000, failed=0

real    0m1.183s
user    0m0.028s
sys 0m1.137s
Count = 1080000, total=1080000000, failed=0

real    0m1.179s
user    0m0.017s
sys 0m1.143s
Count = 1080000, total=1080000000, failed=0

real    0m1.196s
user    0m0.026s
sys 0m1.152s
Count = 1080000, total=1080000000, failed=0

real    0m1.197s
user    0m0.023s
sys 0m1.157s
Count = 1080000, total=1080000000, failed=0

real    0m1.188s
user    0m0.021s
sys 0m1.147s

現在malloc

Count = 1080000, total=1080000000, failed=0

real    0m0.582s
user    0m0.063s
sys 0m0.482s
Count = 1080000, total=1080000000, failed=0

real    0m0.586s
user    0m0.062s
sys 0m0.489s
Count = 1080000, total=1080000000, failed=0

real    0m0.582s
user    0m0.059s
sys 0m0.483s
Count = 1080000, total=1080000000, failed=0

real    0m0.590s
user    0m0.064s
sys 0m0.477s
Count = 1080000, total=1080000000, failed=0

real    0m0.586s
user    0m0.075s
sys 0m0.473s

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM