调用本机代码的C＃比本机调用本机代码更快

Question

While doing some performance testing, I've run into a situation that I cannot seem to explain. 在进行一些性能测试时，我遇到了一些我似乎无法解释的情况。

I have written the following C code: 我写了以下C代码：

void multi_arr(int32_t *x, int32_t *y, int32_t *res, int32_t len)
{
    for (int32_t i = 0; i < len; ++i)
    {
        res[i] = x[i] * y[i];
    }
}

I use gcc to compile it, along with a test driver, into a single binary. 我使用gcc将它与测试驱动程序一起编译成单个二进制文件。 I also use gcc to compile it by itself into a shared object which I call from C# via p/invoke. 我还使用gcc将它自己编译成一个共享对象，我通过p / invoke从C＃调用它。 The intent is to measure the performance overhead of calling native code from C#. 目的是衡量从C＃调用本机代码的性能开销。

In both C and C#, I create equal length input arrays of random values and then measure how long it takes multi_arr to run. 在C和C＃中，我创建等长的随机值输入数组，然后测量multi_arr运行所需的时间。 In both C# and CI use the POSIX clock_gettime() call for timing. 在C＃和CI中，使用POSIX clock_gettime（）调用进行计时。 I have positioned the timing calls immediately preceding and following the call to multi_arr, so input prep time etc do not impact results. 我已经在调用multi_arr之前和之后定位了定时调用，因此输入准备时间等不会影响结果。 I run 100 iterations and report both the average and the min times. 我运行100次迭代并报告平均时间和最小时间。

Even though C and C# are executing the exact same function, C# comes out ahead about 50% of the time, usually by a significant amount. 尽管C和C＃正在执行完全相同的功能，但C＃在大约50％的时间内提前出现，通常是大量的。 For example, for a len of 1,048,576, C#'s min is 768,400 ns vs C's min of 1,344,105. 例如，对于1,048,576的len，C＃的最小值为768,400 ns，而C的最小值为1,344,105。 C#'s avg is 1,018,865 vs C's 1,852,880. C＃的平均值为1,018,865，而C的1,852,880。 I put some different numbers into this graph (mind the log scales): 我在这个图中添加了一些不同的数字（记住日志标度）：

These results seem extremely wrong to me, but the artifact is consistent across multiple tests. 这些结果对我来说似乎非常错误，但工件在多个测试中是一致的。 I've checked the asm and IL to verify correctness. 我检查了asm和IL来验证是否正确。 Bitness is the same. 比特是一样的。 I have no idea what could be impacting performance to this degree. 我不知道在这个程度上可能会影响性能。 I've put a minimal reproduction example up here . 我已经把一个最小的例子再现了这里。

These tests were all run on Linux (KDE neon, based off Ubuntu Xenial) with dotnet-core 2.0.0 and gcc 5.0.4. 这些测试都是在Linux（KDE neon，基于Ubuntu Xenial）上使用dotnet-core 2.0.0和gcc 5.0.4运行的。

Has anyone seen this before? 谁看过这个吗？

Answer 1

It is dependent on alignment, as you are already suspecting. 正如您已经怀疑的那样，这取决于对齐方式。 Memory is returned such that the compiler can use it for structures that will not cause unnecessary faults when storing or retrieving datatypes such as doubles or integers, but it makes no promise as to how the block of memory fits into the cache(s). 返回内存，以便编译器可以将其用于在存储或检索数据类型（如双精度或整数）时不会导致不必要的错误的结构，但它不会对内存块如何适应缓存做出承诺。

How this varies is dependent on the hardware that you test on. 这种变化取决于您测试的硬件。 Presuming you are talking about x86_64 here, that means the Intel or AMD processor and its relative speed of the caches compared to main memory access. 假设你在这里谈论x86_64，这意味着英特尔或AMD处理器及其相对于主存储器访问的缓存的相对速度。

You can simulate this by testing with various alignments. 您可以通过各种对齐测试来模拟这一点。

Here is an example program that I cobbled together. 这是我拼凑在一起的示例程序。 On my i7 I see large variations, but the first most unaligned access is reliably slower than the more aligned versions. 在我的i7上，我看到了很大的变化，但第一个最不对齐的访问速度比更对齐的版本慢得多。

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

void multi_arr(int32_t *x, int32_t *y, int32_t *res, int32_t len)
{
    for (int32_t i = 0; i < len; ++i)
    {
        res[i] = x[i] * y[i];
    }
}

uint64_t getnsec()
{
  struct timespec n;

  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &n);
  return (uint64_t) n.tv_sec * 1000000000 + n.tv_nsec;
}

#define CACHE_SIZE (16 * 1024 * 1024 / sizeof(int32_t))
int main()
{
  int32_t *memory;
  int32_t *unaligned;
  int32_t *x;
  int32_t *y;
  int count;
  uint64_t start, elapsed;
  int32_t len = 1024 * 16;
  int64_t aligned = 1;

  memory = calloc(sizeof(int32_t), 4 * CACHE_SIZE);

  // make unaligned as unaligned as possible, e.g. to 0b11111111111111100

  unaligned = (int32_t *) (((intptr_t) memory + CACHE_SIZE) & ~(CACHE_SIZE - 1));
  printf("memory starts at %p, aligned %p\n", memory, unaligned);
  unaligned = (int32_t *) ((intptr_t) unaligned | (CACHE_SIZE - 1));
  printf("memory starts at %p, unaligned %p\n", memory, unaligned);

  for (aligned = 1; aligned < CACHE_SIZE; aligned <<= 1)
  {
    x = (int32_t *) (((intptr_t) unaligned + CACHE_SIZE) & ~(aligned - 1));

    start = getnsec();
    for (count = 1; count < 1000; count++)
    {
      multi_arr(x, x + len, x + len + len, len);
    }
    elapsed = getnsec() - start;
    printf("memory starts at %p, aligned %08"PRIx64" to > cache = %p elapsed=%"PRIu64"\n", unaligned, aligned - 1, x, elapsed);
  }

  exit(0);
}

调用本机代码的C＃比本机调用本机代码更快

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-10-03 08:57:08

调用本机代码的C＃比本机调用本机代码更快

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-10-03 08:57:08

解决方案1
2 已采纳 2017-10-03 08:57:08