简体   繁体   English

std :: inner_product比手动速度快4倍,但没有使用SIMD吗?

[英]std::inner_product 4x faster than manual but no SIMD being used?

I was interested how std::inner_product() performs compared with a manual dot-product calculation, so I did a test. 我很感兴趣std::inner_product()与手动点积计算的性能如何,因此我进行了测试。

std::inner_product() was 4x faster than the manual implementation. std::inner_product()比手动实现快4倍 I find this odd because there aren't really that many ways to calculate it, surely?! 我觉得这很奇怪,因为确实没有很多方法可以计算出来? I also cannot see any SSE/AVX registers being used at the point of calculation. 我也看不到在计算时正在使用任何SSE / AVX寄存器。

Setup: VS2013/MSVC(12?), Haswell i7 4770 CPU, 64-bit compilation, release mode. 设置:VS2013 / MSVC(12?),Haswell i7 4770 CPU,64位编译, 发布模式。

Here is the C++ test code: 这是C ++测试代码:

#include <iostream>
#include <functional>
#include <numeric>
#include <cstdint>

int main() {
   const int arraySize = 1000;
   const int numTests = 500;
   unsigned int x, y = 0;
   unsigned long long* array1 = new unsigned long long[arraySize];
   unsigned long long* array2 = new unsigned long long[arraySize];

   //Initialise arrays
   for (int i = 0; i < arraySize; i++){
      unsigned long long val = __rdtsc();
      array1[i] = val;
      array2[i] = val;
   }

   //std::inner_product test
   unsigned long long timingBegin1 = __rdtscp(&s);
   for (int i = 0; i < numTests; i++){
      volatile unsigned long long result = std::inner_product(array1, array1 + arraySize, array2, static_cast<uint64_t>(0));
   }
   unsigned long long timingEnd1 = __rdtscp(&s);

   f, s = 0;

   //Manual Dot Product test
   unsigned long long timingBegin2 = __rdtscp(&f);
   for (int i = 0; i < numTests; i++){
      volatile unsigned long long result = 0;

      for (int i = 0; i < arraySize; i++){
         result += (array1[i] * array2[i]);
      }
   }
   unsigned long long timeEnd2 = __rdtscp(&f);


   std::cout << "STL:     :  " << static_cast<double>(finish1 - start1) / numTests << " CPU cycles per dot product" << std::endl;
   std::cout << "Manually :  " << static_cast<double>(finish2 - start2) / numTests << " CPU cycles per dot product" << std::endl;

Your test is bad, and this is likely to make a big difference. 您的测试很糟糕,这可能会带来很大的不同。

 volatile uint64_t result = 0; for (int i = 0; i < arraySize; i++){ result += (array1[i] * array2[i]); 

Note how you're continually using the volatile -qualified variable here. 请注意您在这里如何继续使用volatile -qualified变量。 This forces the compiler to write the temporary results to memory. 这迫使编译器将临时结果写入内存。

In contrast, your inner_product version: 相反,您的inner_product版本:

 volatile uint64_t result = std::inner_product(array1, array1 + arraySize, array2, static_cast<uint64_t>(0)); 

first calculates the inner product, allowing optimisations, and only then assigns the result to a volatile -qualfied variable. 首先计算内部乘积,进行优化,然后将结果分配给volatile限定变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM