简体   繁体   中英

Why is a dynamically linked executable noticeably slower than the statically linked one in Linux?

Testing with a toy program that determines the result of a tic-tac-toe board, I got this. What's making this quite big difference? I'd suspect that the calls to rand is faster with a statically linked libc, but still surprised with the result.

~$ gcc a.c -std=c11 -O3
~$ time ./a.out
32614644

real    0m9.396s
user    0m9.388s
sys     0m0.004s

~$ gcc a.c -std=c11 -O3 -static
~$ time ./a.out
32614644

real    0m6.891s
user    0m6.884s
sys     0m0.000s

#include <stdio.h>
#include <stdlib.h>

#define SIZE 3
#define SIZE_2 (SIZE * SIZE)

static int determineResult(int board[static SIZE_2]) {
  for (int i = 0; i < SIZE_2; i += SIZE) {
    if (!board[i]) {
      continue;
    }
    for (int j = i + 1; j < i + SIZE; ++j) {
      if (board[i] != board[j]) {
        goto next;
      }
    }
    return board[i];
  next:;
  }
  for (int i = 0; i < SIZE; ++i) {
    if (!board[i]) {
      continue;
    }
    for (int j = i + SIZE; j < i + SIZE_2; j += SIZE) {
      if (board[i] != board[j]) {
        goto next2;
      }
    }
    return board[i];
  next2:;
  }
  for (int i = SIZE + 1; i < SIZE_2; i += SIZE + 1) {
    if (board[i] != *board) {
      goto next3;
    }
  }
  return *board;
next3:
  for (int i = SIZE * 2 - 2; i <= SIZE_2 - SIZE; i += SIZE - 1) {
    if (board[i] != board[SIZE - 1]) {
      return 0;
    }
  }
  return board[SIZE - 1];
}

#define N 50000000

int main(void) {
  srand(0);
  size_t n = 0;
  for (int i = 0; i < N; ++i) {
    int board[SIZE_2];
    for (int i = 0; i < SIZE_2; ++i) {
      board[i] = rand() % 3;
    }
    n += determineResult(board);
  }
  printf("%zu\n", n);
  return EXIT_SUCCESS;
}

I can't be sure this is the reason without knowing the particular ABI (which depends on OS and cpu architecture) your system is using, but the following is the most likely explanation.

On most implementations, code in shared libraries (including shared libc.so ) has to be position-independent code . This means it can be loaded at any address (rather than assigned a fixed run-time address by the linker) and thus cannot use hard-coded absolute data addresses in the machine code. Instead, it has to access global data via either instruction-pointer-relative addressing or a global offset table (GOT) whose address is either kept in a register or computed relative to the instruction pointer. These addressing modes are efficient mainly on well-designed modern instruction set architectures like x86_64, AArch64, RISC-V, etc. On most other architectures, including 32-bit x86, they're quite inefficient. For example, the following function:

int x;
int get_x()
{
    return x;
}

will balloon into something like the following on x86:

get_x:
    push %ebp
    mov %esp, %ebp
    push %ebx
    sub $4, %esp
    call __x86.get_pc_thunk_bx
    add $_GLOBAL_OFFSET_TABLE_, %ebx
    mov x@GOT(%ebx), %eax
    mov (%eax),%eax
    add $4, %esp
    pop %ebx
    pop %ebp
    ret

whereas you would expect (for non-position-independent-code) to see:

get_x:
    mov x, %eax
    ret

Being that random number generators have internal (global) state, they're stuck doing this expensive dance for position-independent code. And being that the actual computation they do is likely very short and fast, the PIC overhead is probably a significant part of their run time.

One way to confirm this theory would be to try using rand_r or random_r instead. These functions use caller-provided state and thus can (at least in theory) avoid any internal access to global data.

The problem here is that you are comparing the total execution time, in a small example like this it will be greatly superior for the static linking example since there are no lookups to be done.

The big difference between static and dynamic linking is that dynamic linking has several modules/objects which are linked together at runtime and that statically compiled binaries have everything contained within the binary. There are some specifics that can vary of course, but that's roughly it.

So... taken the above into consideration, starting an executable, loading a few different files, executing your function and returning is probably going to take more time than load the file and execute your function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM