GCC優化性能“不佳”

Question

我試圖理解為什么將-O2 -march = native與GCC一起使用會比不使用它們提供慢的代碼。 請注意，我在Windows 7下使用MinGW（GCC 4.7.1）。

這是我的代碼：

struct.hpp：

#ifndef STRUCT_HPP
#define STRUCT_HPP

#include <iostream>

class Figure
{
public:
    Figure(char *pName);
    virtual ~Figure();

    char *GetName();
    double GetArea_mm2(int factor);

private:
    char name[64];
    virtual double GetAreaEx_mm2() = 0;
};

class Disk : public Figure
{
public:
    Disk(char *pName, double radius_mm);
    ~Disk();

private:
    double radius_mm;
    virtual double GetAreaEx_mm2();
};

class Square : public Figure
{
public:
    Square(char *pName, double side_mm);
    ~Square();  

private:
    double side_mm;
    virtual double GetAreaEx_mm2();
};

#endif

struct.cpp：

#include <cstdio>
#include "struct.hpp"

Figure::Figure(char *pName)
{
    sprintf(name, pName);
}

Figure::~Figure()
{
}

char *Figure::GetName()
{
    return name;
}

double Figure::GetArea_mm2(int factor)
{
    return (double)factor*GetAreaEx_mm2();
}

Disk::Disk(char *pName, double radius_mm_) :
Figure(pName), radius_mm(radius_mm_)
{
}

Disk::~Disk()
{
}

double Disk::GetAreaEx_mm2()
{
    return 3.1415926*radius_mm*radius_mm;
}

Square::Square(char *pName, double side_mm_) :
Figure(pName), side_mm(side_mm_)
{
}

Square::~Square()
{
}

double Square::GetAreaEx_mm2()
{
    return side_mm*side_mm;
}

main.cpp中

#include <iostream>
#include <cstdio>
#include "struct.hpp"

double Do(int n)
{
    double sum_mm2 = 0.0;
    const int figuresCount = 10000;
    Figure **pFigures = new Figure*[figuresCount];

    for (int i = 0; i < figuresCount; ++i)
    {
        if (i % 2)
            pFigures[i] = new Disk((char *)"-Disque", i);
        else
            pFigures[i] = new Square((char *)"-Carré", i);
    }

    for (int a = 0; a < n; ++a)
    {
        for (int i = 0; i < figuresCount; ++i)
        {
            sum_mm2 += pFigures[i]->GetArea_mm2(i);
            sum_mm2 += (double)(pFigures[i]->GetName()[0] - '-');
        }
    }

    for (int i = 0; i < figuresCount; ++i)
        delete pFigures[i];

    delete[] pFigures;

    return sum_mm2;
}

int main()
{
    double a = 0;

    StartChrono();      // home made lib, working fine
    a = Do(10000);
    double elapsedTime_ms = StopChrono();

    std::cout << "Elapsed time : " << elapsedTime_ms << " ms" << std::endl;

    return (int)a % 2;  // To force the optimizer to keep the Do() call
}

我兩次編譯此代碼：

1：沒有優化

mingw32-g ++。exe -Wall -fexceptions -std = c ++ 11 -c main.cpp -o main.o

mingw32-g ++。exe -Wall -fexceptions -std = c ++ 11 -c struct.cpp -o struct.o

mingw32-g ++。exe -o program.exe main.o struct.o -s

2：具有-O2優化

mingw32-g ++。exe -Wall -fexceptions -O2 -march = native -std = c ++ 11 -c main.cpp -o main.o

mingw32-g ++。exe -Wall -fexceptions -O2 -march = native -std = c ++ 11 -c struct.cpp -o struct.o

mingw32-g ++。exe -o program.exe main.o struct.o -s

1：執行時間：

1196毫秒（Visual Studio 2013為1269毫秒）

2：執行時間：

1569毫秒（Visual Studio 2013為403毫秒）!!!!!!!!!!!!!

使用-O3代替-O2不能改善結果。 我曾經並且仍然非常確信GCC和Visual Studio是等效的，所以我不理解這種巨大的差異。 另外，我不明白為什么優化版本比GCC的非優化版本要慢。

我在這里想念什么嗎？ （請注意，我在Ubuntu上使用正版GCC 4.8.2存在相同的問題）

謝謝你的幫助

Answer 1

考慮到我沒有看到匯編代碼，我將推測以下內容：

可以（通過編譯器）通過除去if子句並導致以下情況來優化分配循環：

 for (int i=0;i <10000 ; i+=2)
 {
       pFigures[i] = new Square(...);
 }
 for (int i=1;i <10000 ; i +=2)
 {
       pFigures[i] = new Disk(...);
 }

考慮到結束條件是4的倍數，它甚至可以更“有效”

 for (int i=0;i < 10000 ;i+=2*4)
 {
     pFigures[i] = ...
     pFigures[i+2] = ...
     pFigures[i+4] = ...
     pFigures[i+6] = ...
 }

在內存方面，這將使磁盤分配為4×4和4×4。

現在，這意味着它們將在內存中彼此相鄰。

接下來，您將以正常順序將向量迭代10000次（通常，我的意思是索引之后的索引）。

考慮一下這些形狀在內存中的分配位置，最終將導致高速緩存未命中次數增加4倍（以邊框為例，當在不同頁面中找到4個磁盤和4個正方形時，您將在頁面之間切換8次。 ..在通常情況下，您只能在頁面之間切換一次）。

這種優化（如果由編譯器完成，並且在您的特定代碼中進行）會優化分配時間，但不會優化訪問時間（在您的示例中，這是最大的負載）。

通過刪除i％2進行測試，然后查看獲得的結果。

同樣，這純粹是推測，它假定性能降低的原因是循環優化。

Answer 2

我懷疑您在Windows上組合mingw / gcc / glibc時遇到了一個特有的問題，因為在Linux上進行了優化（其中gcc完全“在家”使用）時，代碼的執行速度更快。

在使用gcc 4.8.2的相當普通的Linux VM上：

$ g++ main.cpp struct.cpp
$ time a.out

real    0m2.981s
user    0m2.876s
sys     0m0.079s

$ g++ -O2 main.cpp struct.cpp
$ time a.out

real    0m1.629s
user    0m1.523s
sys     0m0.041s

...並且如果您真的通過刪除struct.cpp並將所有內聯實現移到優化器上而struct.cpp了優化程序：

$ time a.out

real    0m0.550s
user    0m0.543s
sys     0m0.000s

GCC優化性能“不佳”

問題描述

1：沒有優化

2：具有-O2優化

1：執行時間：

2：執行時間：

2 個解決方案

解決方案1
2 2015-08-28 13:52:22

解決方案2
1 2015-08-28 14:42:46

GCC優化性能“不佳”

問題描述

1：沒有優化

2：具有-O2優化

1：執行時間：

2：執行時間：

2 個解決方案

解決方案1 2 2015-08-28 13:52:22

解決方案2 1 2015-08-28 14:42:46

解決方案1
2 2015-08-28 13:52:22

解決方案2
1 2015-08-28 14:42:46