简体   繁体   English

为什么这个C ++代码没有更快?

[英]Why this c++ code is not faster?

all: I have two pieces of code. 全部:我有两段代码。 The first one is: 第一个是:

#include <iostream>

using namespace std;

static constexpr long long n = 1000000000;

int main() {
  int sum = 0;
  int* a = new int[n];
  int* b = new int[n];

  for (long long i=0; i<n; i++) {
    a[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= a[i];
    sum += a[i];
  }

  for (long long i=0; i<n; i++) {
    b[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= b[i];
    sum += b[i];
  }

  cout<<sum<<endl;
}

The second one is: 第二个是:

#include <iostream>

using namespace std;

constexpr long long n = 1000000000;

int main() {
  int* a = new int[n];
  int* b = new int[n];
  int sum = 0;

  for (long long i=0; i<n; i++) {
    a[i] = static_cast<int>(i);
    b[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= a[i];
    sum += a[i];
    sum *= b[i];
    sum += b[i];
  }

  cout<<sum<<endl;
}

I think the first programs should be much faster than the second one, since it's more cache friendly. 我认为第一个程序应该比第二个程序快得多,因为它对缓存更友好。 However, the truth is the second one is a litter faster. 但是,事实是第二个要快一些。 On my server, the first one takes 23s while the second one takes 20s, can some one explain this? 在我的服务器上,第一个需要23秒,而第二个需要20秒,有人可以解释一下吗?

You're not seeing cache-friendliness advantages because the access pattern is still much too simple even in the version you predict to be slower. 您没有看到缓存友好的优势,因为即使在您认为速度较慢的版本中,访问模式仍然过于简单。

Two (or more) concurrent streams of straight-line input is something a modern CPU can detect and stream into L1 ahead of it being needed. 现代CPU可以检测到两个(或多个)并发的直线输入流,并在需要之前将其流进L1。

It can also allow multiple SDRAM banks to be put to useful work at the same time. 它还可以允许将多个SDRAM库同时投入使用。 If you're using Linux you don't get much control over that because pages are mapped randomly (I think; is this still true?), but you can try allocating memory using mmap() with the MAP_HUGETLB argument and then try different offsets from the start of the allocation. 如果您使用的是Linux,则由于页面是随机映射的,因此您将无法获得太多控制权(我认为;是否仍然如此?),但是您可以尝试使用带有MAP_HUGETLB参数的mmap()来分配内存,然后尝试不同的偏移量从分配开始。

If you want to see the advantage of arranging your computations in a cache-friendly order you should perhaps experiment with different access patterns in two-dimensional arrays. 如果您想了解以对缓存友好的顺序安排计算的好处,您应该尝试在二维数组中尝试不同的访问模式。

Caches doesn't play a big role in your example. 缓存在您的示例中没有太大的作用。 Linear access to an array munch bigger than the caches and with nearly no computing between the accesses will allways be limited by the memory bandwith not by the caches. 线性访问比缓存更大的数组,并且访问之间几乎没有计算,总会受到内存带宽的限制,而不受缓存的限制。 They simply don't have enough time to fill up by prefetching. 他们根本没有足够的时间通过预取来填充。

What you are testing is the cleverness of your compiler to optimize your four/two loops into one or his cleverness to get the clue what you are doing and simply print the result. 您要测试的是编译器的聪明程度,它可以将您的四个/两个循环优化为一个循环,或者他的聪明之处是您可以了解正在执行的操作并仅打印结果。

for the first piece of code, you use 4 loops to complete the task. 对于第一段代码,您使用4个循环来完成任务。

for (long long i=0; i<n; i++) {
    a[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= a[i];
    sum += a[i];
  }

  for (long long i=0; i<n; i++) {
    b[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= b[i];
    sum += b[i];
  }

while in the second one you use only 2 loops to complete the task. 而在第二个中,您仅使用2个循环来完成任务。

for (long long i=0; i<n; i++) {
    a[i] = static_cast<int>(i);
    b[i] = static_cast<int>(i);
  }

  for (long long i=0; i<n; i++) {
    sum *= a[i];
    sum += a[i];
    sum *= b[i];
    sum += b[i];
  }

The number of iterations happening is much less in the second piece of code you provided. 您提供的第二段代码中发生的迭代次数要少得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM