為什么C中的結構指針（方法）比普通函數慢得多？

Question

我最近在越來越多的項目中使用C，幾乎最終創建了自己的帶有結構指針的“對象實現”。 但是，我很好奇純粹的功能樣式（帶結構）和結構之間的速度差異，這些結構在更現代的面向對象風格中調用函數指針。

我已經創建了一個示例程序，並且不確定為什么時序差異如此之大。

該程序使用兩個計時器並記錄完成每個任務所花費的時間（一個接一個）。 這不包括內存分配/解除分配，並且兩種技術都以類似的方式設置（每個結構有三個整數作為結構的指針）。

代碼本身只是在一個for循環中重復地將三個數字加在一起，持續時間為宏LOOP_LEN中指定的持續時間。

請注意我有內聯測量的函數，編譯器優化從無到完全優化（/ Ox） （我在Visual Studio中將其作為純.c文件運行）。

對象樣式代碼

// MAGIC object 
typedef struct {

    // Properties
    int* x;
    int* y;
    int* z;

    // Methods
    void(*init)(struct magic* self, int x, int y, int z);
    int(*sum)(struct magic* self);

}magic;

// Variable init function
void* init(magic* self, int x, int y, int z) {

    // Assign variables to properties
    *self->x = x;
    *self->y = y;
    *self->z = y;

    return;

}

// Add all variables together
inline int sum(magic* self) {
    return ((*self->x) + (*self->y) + (*self->z));
}

// Magic object constructor
magic* new_m(int x, int y, int z) {

    // Allocate self
    magic* self = malloc(sizeof(magic));

    // Allocate member pointers
    self->x = malloc(sizeof(int));
    self->y = malloc(sizeof(int));
    self->z = malloc(sizeof(int));

    // Allocate method pointers
    self->init = init;
    self->sum = sum;

    // Return instance
    return self;
}

// Destructor
void delete_m(magic* self) {

    // Deallocate memory from constructor
    free(self->x); self->x = NULL;
    free(self->y); self->y = NULL;
    free(self->z); self->z = NULL;
    free(self); self = NULL;

    return;

}

功能（傳統）風格代碼

// None object oriented approach
typedef struct {
    int* x;
    int* y;
    int* z;
}str_magic;

// Magic struct constructor
str_magic* new_m_str(int x, int y, int z) {

    // Allocate self
    str_magic* self = malloc(sizeof(str_magic));

    // Allocate member pointers
    self->x = malloc(sizeof(int));
    self->y = malloc(sizeof(int));
    self->z = malloc(sizeof(int));

    // Return instance
    return self;
}

// Destructor
void delete_m_str(str_magic* self) {

    // Deallocate memory from constructor
    free(self->x); self->x = NULL;
    free(self->y); self->y = NULL;
    free(self->z); self->z = NULL;
    free(self); self = NULL;

    return;

}

// Sum using normal structure type
inline int sum_str(str_magic* self) {
    return ((*self->x) + (*self->y) + (*self->z));
}

定時器測試和主程序入口點

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define LOOP_LEN 1000000000

// Main entry point
int main(void) {

    // Start timer for first task
    clock_t start1, end1, start2, end2;
    double cpu_time_used1, cpu_time_used2;

    // Init instances before timer
    magic* object1 = new_m(1, 2, 3);

    // Start task1 clock
    start1 = clock();

    for (int i = 0; i < LOOP_LEN; i++) {
        // Perform method sum and store result
        int result1 = object1->sum(object1);
    }

    // Stop task1 clock
    end1 = clock();

    // Remove from memory
    delete_m(object1);

    // Calculate task1 execution time
    cpu_time_used1 = ((double)(end1 - start1)) / CLOCKS_PER_SEC;

    // Init instances before timer
    str_magic* object2 = new_m_str(1, 2, 3);

    // Start task2 clock
    start2 = clock();

    for (int i = 0; i < LOOP_LEN; i++) {
        // Perform function and store result
        int result2 = sum_str(object2);
    }

    // Stop task2 clock
    end2 = clock();

    // Remove from memory
    delete_m_str(object2);

    // Calculate task 2 execution time
    cpu_time_used2 = ((double)(end2 - start2)) / CLOCKS_PER_SEC;

    // Print time results
    printf("----------------------\n    Task 1 : %.*e\n----------------------\n    Task 2 : %.*e\n----------------------\n", cpu_time_used1, cpu_time_used2);

    if (cpu_time_used1 < cpu_time_used2) {
        printf("Object Oriented Approach was faster by %.*e\n", cpu_time_used2-cpu_time_used1);
    }
    else {
        printf("Functional Oriented Approach was faster by %.*e\n", cpu_time_used1 - cpu_time_used2);
    }

    // Wait for keyboard interrupt
    getchar();

    return 0;
}

程序運行的每一次，功能編程總是執行得更快。 我能想到的唯一原因是它必須通過結構訪問一個額外的指針層才能調用該方法，但我認為內聯會減少這種延遲。

雖然隨着優化的增加，延遲會變小，但我很想知道為什么它在低/無優化級別上有這么大的不同，因此這被認為是一種有效的編程風格？

Answer 1

帶/O2循環的第二個循環編譯為：

    call     clock
    mov      edi, eax ; this is used later to calculate time
    call     clock

例如，根本沒有代碼 。 編譯器能夠理解sum_str函數的結果未使用，因此它將其完全刪除。 對於第一種情況，編譯器無法做同樣的事情。

因此，在啟用優化時沒有真正的比較。

沒有優化，只需執行更多代碼。

第一個循環編譯為：

    cmp      DWORD PTR i$1[rsp], 1000000000
    jge      SHORT $LN3@main                 ; loop exit
    mov      rcx, QWORD PTR object1$[rsp]
    mov      rax, QWORD PTR object1$[rsp]    ; extra instruction
    call     QWORD PTR [rax+32]              ; indirect call
    mov      DWORD PTR result1$3[rsp], eax
    jmp      SHORT $LN2@main                 ; jump to the next iteration

第二循環：

    cmp      DWORD PTR i$2[rsp], 1000000000
    jge      SHORT $LN6@main                 ; loop exit
    mov      rcx, QWORD PTR object2$[rsp]
    call     sum_str
    mov      DWORD PTR result2$4[rsp], eax
    jmp      SHORT $LN5@main                 ; jump to the next iteration

將sum和sum_str編譯為等效的指令序列。

不同之處在於循環中的一條指令，而間接調用則較慢。 總的來說，沒有優化的兩個版本之間應該沒有太大的區別 - 兩者都應該很慢。

Answer 2

我想伊萬和你已經提供了答案。 我只想添加內聯函數。 即使您將函數聲明為內聯，編譯器也不一定總是將其視為內聯函數。 基於復雜性編譯器可能會將其視為正常函數。

Answer 3

正如你所說，前一種情況有額外的指針引用間接。 雖然將sum聲明為內聯函數，但由於sum函數指針被放入對象成員中，因此無法輕松內聯。

我建議你將生成的匯編代碼與-O0 ~ -O3進行比較。

為什么C中的結構指針（方法）比普通函數慢得多？

問題描述

對象樣式代碼

功能（傳統）風格代碼

定時器測試和主程序入口點

3 個解決方案

解決方案1
8 已采納 2018-02-06 16:15:00

解決方案2
1 2018-02-06 16:19:50

解決方案3
0 2018-02-06 16:15:20

為什么C中的結構指針（方法）比普通函數慢得多？

問題描述

對象樣式代碼

功能（傳統）風格代碼

定時器測試和主程序入口點

3 個解決方案

解決方案1 8 已采納 2018-02-06 16:15:00

解決方案2 1 2018-02-06 16:19:50

解決方案3 0 2018-02-06 16:15:20

解決方案1
8 已采納 2018-02-06 16:15:00

解決方案2
1 2018-02-06 16:19:50

解決方案3
0 2018-02-06 16:15:20