为什么一个类中相同函数定义的执行时间慢于10倍以上？

Question

Not sure which kind of optimization the compiler does, but why within a class a same function definition is slower than the same called as global method? 不确定compiler执行哪种优化，但为什么在类中相同的函数定义比调用全局方法的速度慢？

#include <iostream>
#include <chrono>

#define MAX_BUFFER 256
const int whileLoops = 1024 * 1024 * 10;

void TracedFunction(int blockSize) {
    std::chrono::high_resolution_clock::time_point pStart;
    std::chrono::high_resolution_clock::time_point pEnd;

    double A[MAX_BUFFER];
    double B[MAX_BUFFER];
    double C[MAX_BUFFER];

    // fill A/B
    for (int sampleIndex = 0; sampleIndex < MAX_BUFFER; sampleIndex++) {
        A[sampleIndex] = sampleIndex;
        B[sampleIndex] = sampleIndex + 1000.0;
    }

    // same traced function
    pStart = std::chrono::high_resolution_clock::now();

    int whileCounter = 0;
    while (whileCounter < whileLoops) {
        for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
            double value = A[sampleIndex] + B[sampleIndex];

            C[sampleIndex] = value;
        }

        whileCounter++;
    }

    pEnd = std::chrono::high_resolution_clock::now();
    std::cout << "execution time: " << std::chrono::duration_cast<std::chrono::milliseconds>(pEnd - pStart).count() << " ms" << " | fake result: " << A[19] << " " << B[90] << " " << C[129] << std::endl;
}

class OptimizeProcess
{
public:
    std::chrono::high_resolution_clock::time_point pStart;
    std::chrono::high_resolution_clock::time_point pEnd;

    double A[MAX_BUFFER];
    double B[MAX_BUFFER];
    double C[MAX_BUFFER];

    OptimizeProcess() {
        // fill A/B
        for (int sampleIndex = 0; sampleIndex < MAX_BUFFER; sampleIndex++) {
            A[sampleIndex] = sampleIndex;
            B[sampleIndex] = sampleIndex + 1000.0;
        }
    }

    void TracedFunction(int blockSize) {
        // same traced function
        pStart = std::chrono::high_resolution_clock::now();

        int whileCounter = 0;
        while (whileCounter < whileLoops) {
            for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
                double value = A[sampleIndex] + B[sampleIndex];

                C[sampleIndex] = value;
            }

            whileCounter++;
        }

        pEnd = std::chrono::high_resolution_clock::now();
        std::cout << "execution time: " << std::chrono::duration_cast<std::chrono::milliseconds>(pEnd - pStart).count() << " ms" << " | fake result: " << A[19] << " " << B[90] << " " << C[129] << std::endl;
    }
};

int main() {
    int blockSize = MAX_BUFFER;

    // outside class
    TracedFunction(blockSize);

    // within class
    OptimizeProcess p1;
    p1.TracedFunction(blockSize);

    std::cout << std::endl;
    system("pause");

    return 0;
}

Tried with MSVC , /Oi /Ot . 试过MSVC ， /Oi /Ot 。

~80ms vs 1200ms. ~80ms vs 1200ms。 Is there loop unrolling using blockSize as constant at compile-time ? 是否在compile-time使用blockSize作为常量进行循环展开？

Not sure, since I've tried to set blockSize random with: 不确定，因为我试图将blockSize随机设置为：

std::mt19937_64 gen{ std::random_device()() };
std::uniform_real_distribution<double> dis{ 0.0, 1.0 };

int blockSize = dis(gen) * 255 + 1;

Same results... 结果相同......

Answer 1

If you compile with the maximum optimization flag of GCC, ie O3 , then you will get similar execution times. 如果使用GCC的最大优化标志（即O3进行编译，那么您将获得类似的执行时间。

There is no difference in the aspect of executing a function within or not a class, wrt execution time. 在执行时间内执行函数或不执行函数的方面没有区别。

The only difference that I see, is when and how you create your arrays. 我看到的唯一区别是，您何时以及如何创建阵列。 In the first function, the arrays are automatic variables of the function. 在第一个函数中，数组是函数的自动变量。 In the within function, the arrays are data members of the class. 在within函数中，数组是类的数据成员。

That can play a role in certain cases. 在某些情况下，这可以发挥作用。 Make the arrays global (create them only once), and you will see no difference in your execution times (regardless of using O1 , O2 or O3 ). 使数组全局化（仅创建一次），您将看到执行时间没有差异（无论使用O1 ， O2还是O3 ）。

Note: Compile with O2 , and you will get a faster execution time for the within function (that's the other way around of what you mention). 注意：使用O2编译，您将获得内部函数更快的执行时间（这与您提到的相反）。 To be precise a x1.35 speedup, as you can see in the Live Demo . 准确地说是x1.35加速，正如您在Live Demo中看到的那样。

Nevertheless, remember than when optimization is done right, with O3 in this case, you shouldn't see any significant differences whatsoever! 不过，请记住，当优化正确完成时，在这种情况下使用O3 ，您不应该看到任何重大差异！

为什么一个类中相同函数定义的执行时间慢于10倍以上？

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-10-27 10:05:47

为什么一个类中相同函数定义的执行时间慢于10倍以上？

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-10-27 10:05:47

解决方案1
2 已采纳 2018-10-27 10:05:47