简体   繁体   English

为什么一个类中相同函数定义的执行时间慢于10倍以上?

[英]Why the execution time of same function definition within a class is slower more than 10x time?

Not sure which kind of optimization the compiler does, but why within a class a same function definition is slower than the same called as global method? 不确定compiler执行哪种优化,但为什么在类中相同的函数定义比调用全局方法的速度慢?

#include <iostream>
#include <chrono>

#define MAX_BUFFER 256
const int whileLoops = 1024 * 1024 * 10;

void TracedFunction(int blockSize) {
    std::chrono::high_resolution_clock::time_point pStart;
    std::chrono::high_resolution_clock::time_point pEnd;

    double A[MAX_BUFFER];
    double B[MAX_BUFFER];
    double C[MAX_BUFFER];

    // fill A/B
    for (int sampleIndex = 0; sampleIndex < MAX_BUFFER; sampleIndex++) {
        A[sampleIndex] = sampleIndex;
        B[sampleIndex] = sampleIndex + 1000.0;
    }

    // same traced function
    pStart = std::chrono::high_resolution_clock::now();

    int whileCounter = 0;
    while (whileCounter < whileLoops) {
        for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
            double value = A[sampleIndex] + B[sampleIndex];

            C[sampleIndex] = value;
        }

        whileCounter++;
    }

    pEnd = std::chrono::high_resolution_clock::now();
    std::cout << "execution time: " << std::chrono::duration_cast<std::chrono::milliseconds>(pEnd - pStart).count() << " ms" << " | fake result: " << A[19] << " " << B[90] << " " << C[129] << std::endl;
}

class OptimizeProcess
{
public:
    std::chrono::high_resolution_clock::time_point pStart;
    std::chrono::high_resolution_clock::time_point pEnd;

    double A[MAX_BUFFER];
    double B[MAX_BUFFER];
    double C[MAX_BUFFER];

    OptimizeProcess() {
        // fill A/B
        for (int sampleIndex = 0; sampleIndex < MAX_BUFFER; sampleIndex++) {
            A[sampleIndex] = sampleIndex;
            B[sampleIndex] = sampleIndex + 1000.0;
        }
    }

    void TracedFunction(int blockSize) {
        // same traced function
        pStart = std::chrono::high_resolution_clock::now();

        int whileCounter = 0;
        while (whileCounter < whileLoops) {
            for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
                double value = A[sampleIndex] + B[sampleIndex];

                C[sampleIndex] = value;
            }

            whileCounter++;
        }

        pEnd = std::chrono::high_resolution_clock::now();
        std::cout << "execution time: " << std::chrono::duration_cast<std::chrono::milliseconds>(pEnd - pStart).count() << " ms" << " | fake result: " << A[19] << " " << B[90] << " " << C[129] << std::endl;
    }
};

int main() {
    int blockSize = MAX_BUFFER;

    // outside class
    TracedFunction(blockSize);

    // within class
    OptimizeProcess p1;
    p1.TracedFunction(blockSize);

    std::cout << std::endl;
    system("pause");

    return 0;
}

Tried with MSVC , /Oi /Ot . 试过MSVC/Oi /Ot

~80ms vs 1200ms. ~80ms vs 1200ms。 Is there loop unrolling using blockSize as constant at compile-time ? 是否在compile-time使用blockSize作为常量进行循环展开?

Not sure, since I've tried to set blockSize random with: 不确定,因为我试图将blockSize随机设置为:

std::mt19937_64 gen{ std::random_device()() };
std::uniform_real_distribution<double> dis{ 0.0, 1.0 };

int blockSize = dis(gen) * 255 + 1;

Same results... 结果相同......

If you compile with the maximum optimization flag of GCC, ie O3 , then you will get similar execution times. 如果使用GCC的最大优化标志(即O3进行编译,那么您将获得类似的执行时间。

There is no difference in the aspect of executing a function within or not a class, wrt execution time. 在执行时间内执行函数或不执行函数的方面没有区别。


The only difference that I see, is when and how you create your arrays. 我看到的唯一区别是,您何时以及如何创建阵列。 In the first function, the arrays are automatic variables of the function. 在第一个函数中,数组是函数的自动变量。 In the within function, the arrays are data members of the class. 在within函数中,数组是类的数据成员。

That can play a role in certain cases. 在某些情况下,这可以发挥作用。 Make the arrays global (create them only once), and you will see no difference in your execution times (regardless of using O1 , O2 or O3 ). 使数组全局化(仅创建一次),您将看到执行时间没有差异(无论使用O1O2还是O3 )。


Note: Compile with O2 , and you will get a faster execution time for the within function (that's the other way around of what you mention). 注意:使用O2编译,您将获得内部函数更快的执行时间(这与您提到的相反)。 To be precise a x1.35 speedup, as you can see in the Live Demo . 准确地说是x1.35加速,正如您在Live Demo中看到的那样。

Nevertheless, remember than when optimization is done right, with O3 in this case, you shouldn't see any significant differences whatsoever! 不过,请记住,当优化正确完成时,在这种情况下使用O3 ,您不应该看到任何重大差异!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 pybind11 c++ unordered_map 比 python 字典慢 10 倍? - pybind11 c++ unordered_map 10x slower than python dict? 功能相同吗? 与GMP(C ++)一起运行大约慢10倍 - Same function? runs about 10 time slower with GMP(C++) c++ 执行时间比python的慢 - c++ execution time is slower than python's 并行执行比串行执行花费更多时间 - Parallel execution taking more time than serial 并行执行比串行执行需要更多时间? - Parallel Execution taking more time than Serial? 为什么同一执行时间不同? - why the time is different for the same execution? 为什么第一次拨打电话的费用比第二次拨打电话和第三次电话费用要多得多?等等? - Why function first-time calling costs much more time than the second time calling it and third and so on? 为什么 clang 使 Quake 快速反平方根代码比使用 GCC 快 10 倍? (带有 *(long*)float 类型双关语) - Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with *(long*)float type punning) 数组中不同的浮点值会影响性能 10 倍 - 为什么? - Different float values in array impact performance by 10x - why? 为什么一个子线程的执行时间多于整个应用程序的执行时间 - Why execution time of one sub-thread is more than that of the whole application
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM