为什么一个线程比调用函数更快，mingw

Question

When I call function execution time is 6.8 sec. 当我调用函数执行时间是6.8秒。 Call it from a thread time is 3.4 sec and when using 2 thread 1.8 sec. 从线程时间调用它是3.4秒，当使用2线程1.8秒时。 No matter what optimization I use rations stay same. 无论我使用什么优化口粮保持相同。

In Visual Studio times are like expected 3.1, 3 and 1.7 sec. 在Visual Studio中，时间与预期的3.1,3和1.7秒相同。

#include<math.h>
#include<stdio.h>
#include<windows.h>
#include <time.h>

using namespace std;

#define N 400

float a[N][N];

struct b{
    int begin;
    int end;
};

DWORD WINAPI thread(LPVOID p)
{
    b b_t = *(b*)p;

    for(int i=0;i<N;i++)
        for(int j=b_t.begin;j<b_t.end;j++)
        {
            a[i][j] = 0;
            for(int k=0;k<i;k++)
                a[i][j]+=k*sin(j)-j*cos(k);
        }

    return (0);
}

int main()
{
    clock_t t;
    HANDLE hn[2];

    b b_t[3];

    b_t[0].begin = 0;
    b_t[0].end = N;

    b_t[1].begin = 0;
    b_t[1].end = N/2;

    b_t[2].begin = N/2;
    b_t[2].end = N;

    t = clock();
    thread(&b_t[0]);
    printf("0 - %d\n",clock()-t);

    t = clock();
    hn[0] = CreateThread ( NULL, 0, thread,  &b_t[0], 0, NULL);
    WaitForSingleObject(hn[0], INFINITE );
    printf("1 - %d\n",clock()-t);

    t = clock();
    hn[0] = CreateThread ( NULL, 0, thread,  &b_t[1], 0, NULL);
    hn[1] = CreateThread ( NULL, 0, thread,  &b_t[2], 0, NULL);
    WaitForMultipleObjects(2, hn, TRUE, INFINITE );
    printf("2 - %d\n",clock()-t);

    return 0;
}

Times: 时报：

0 - 6868
1 - 3362
2 - 1827

CPU - Core 2 Duo T9300 CPU - 酷睿2双核T9300

OS - Windows 8, 64 - bit 操作系统 - Windows 8,64位

compiler: mingw32-g++.exe, gcc version 4.6.2 编译器：mingw32-g ++。exe，gcc版本4.6.2

edit: 编辑：

Tried different order, same result, even tried separate applications. 试过不同的顺序，相同的结果，甚至试过单独的应用程序。 Task Manager showing CPU Utilization around 50% for function and 1 thread and 100% for 2 thread 任务管理器显示功能和1线程的CPU利用率约为50％，2线程的CPU利用率为100％

Sum of all elements after each call is the same: 3189909.237955 每次调用后所有元素的总和是相同的：3189909.237955

Cygwin result: 2.5, 2.5 and 2.5 sec Linux result(pthread): 3.7, 3.7 and 2.1 sec Cygwin结果：2.5,2.5和2.5秒Linux结果（pthread）：3.7,3.7和2.1秒

@borisbn results: 0 - 1446 1 - 1439 2 - 721. @borisbn结果：0 - 1446 1 - 1439 2 - 721。

Answer 1

The difference is a result of something in the math library implementing sin() and cos() - if you replace the calls to those functions with something else that takes time the significant difference between step and 0 and step 1 goes away. 不同之处在于数学库中实现sin()和cos()的某些结果 - 如果用其他需要时间的东西替换对这些函数的调用，则步骤0和步骤1之间的显着差异消失。

Note that I see the difference with gcc (tdm-1) 4.6.1 , which is a 32-bit toolchain targeting 32 bit binaries. 请注意，我看到与gcc (tdm-1) 4.6.1的区别，这是一个针对32位二进制文件的32位工具链。 Optimization makes no difference (not surprising since it seems to be something in the math library). 优化没有区别（这并不奇怪，因为它似乎是数学库中的东西）。

However, if I build using gcc (tdm64-1) 4.6.1 , which is a 64-bit toolchain, the difference does not appear - regardless if the build is creating a 32-bit program (using the -m32 option) or a 64-bit program ( -m64 ). 但是，如果我使用gcc (tdm64-1) 4.6.1 64位工具链gcc (tdm64-1) 4.6.1构建，则不会出现差异 - 无论构建是创建32位程序（使用-m32选项）还是64位程序（ -m64 ）。

Here are some example test runs (I made minor modifications to the source to make it C99 compatible): 以下是一些示例测试运行（我对源进行了少量修改以使其与C99兼容）：

Using the 32-bit TDM MinGW 4.6.1 compiler: 使用32位TDM MinGW 4.6.1编译器：

 C:\\temp>gcc --version gcc (tdm-1) 4.6.1 C:\\temp>gcc -m32 -std=gnu99 -o test.exe test.c C:\\temp>test 0 - 4082 1 - 2439 2 - 1238

Using the 64-bit TDM 4.6.1 compiler: 使用64位TDM 4.6.1编译器：

 C:\\temp>gcc --version gcc (tdm64-1) 4.6.1 C:\\temp>gcc -m32 -std=gnu99 -o test.exe test.c C:\\temp>test 0 - 2506 1 - 2476 2 - 1254 C:\\temp>gcc -m64 -std=gnu99 -o test.exe test.c C:\\temp>test 0 - 3031 1 - 3031 2 - 1539

A little more information: 更多信息：

The 32-bit TDM distribution (gcc (tdm-1) 4.6.1) links to the sin() / cos() implementations in the msvcrt.dll system DLL via a provided import library: 32位TDM分发（gcc（tdm-1）4.6.1）通过提供的导入库链接到msvcrt.dll系统DLL中的sin() / cos()实现：

c:/mingw32/bin/../lib/gcc/mingw32/4.6.1/../../../libmsvcrt.a(dcfls00599.o)
                0x004a113c                _imp__cos

While the 64-bit distribution (gcc (tdm64-1) 4.6.1) doesn't appear to do that, instead linking to some static library implementation provided with the distribution: 虽然64位分发（gcc（tdm64-1）4.6.1）似乎没有这样做，而是链接到随分发提供的一些静态库实现：

c:/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/4.6.1/../../../../x86_64-w64-mingw32/lib/../lib32/libmingwex.a(lib32_libmingwex_a-cos.o)
                              C:\Users\mikeb\AppData\Local\Temp\cc3pk20i.o (cos)

Update/Conclusion: 更新/结论：

After a bit of spelunking in a debugger stepping through the assembly of msvcrt.dll 's implementation of cos() I've found that the difference in the timing of the main thread versus an explicitly created thread is due to the FPU's precision being set to a non-default setting (presumably the MinGW runtime in question does this at start up). 在调试器中通过msvcrt.dll的cos()实现程序集进行一些探测后，我发现主线程与显式创建线程的时序差异是由于FPU的精度设置到非默认设置（可能是有问题的MinGW运行时在启动时执行此操作）。 In the situation where the thread() function takes twice as long, the FPU is set to 64-bit precision ( REAL10 or in MSVC-speak _PC_64 ). 在thread()函数需要两倍的情况下，FPU设置为64位精度（ REAL10或MSVC-speak _PC_64 ）。 When the FPU control word is something other than 0x27f (the default state?), the msvcrt.dll runtime will perform the following steps in the sin() and cos() function (and probably other floating point functions): 当FPU控制字不是0x27f（默认状态？）时， msvcrt.dll运行时将在sin()和cos()函数中执行以下步骤（可能还有其他浮点函数）：

save the current FPU control word 保存当前的FPU控制字
set the FPU control word to 0x27f (I believe it's possible for this value to be modified) 将FPU控制字设置为0x27f（我相信可以修改此值）
perform the fsin / fcos operation 执行fsin / fcos操作
restore the saved FPU control word 恢复已保存的FPU控制字

The save/restore of the FPU control word is skipped if it's already set to the expected/desired 0x27f value. 如果FPU控制字已经设置为预期/期望的0x27f值，则跳过保存/恢复。 Apparently saving/restoring the FPU control word is expensive, since it appears to double the amount of time the function takes. 显然，保存/恢复FPU控制字是昂贵的，因为它似乎使函数所花费的时间加倍。

You can solve the problem by adding the following line to main() before calling thread() : 您可以通过在调用thread()之前将以下行添加到main()来解决此问题：

_control87( _PC_53, _MCW_PC);   // requires <float.h>

Answer 2

Not a cache matter here. 这里不是cache matter 。

Likely different runtime libraries for user created threads and main thread. 用户创建的线程和主线程可能有不同的运行时库。 You may compare the calculations a[i][j]+=k*sin(j)-j*cos(k); 您可以比较计算a[i][j]+=k*sin(j)-j*cos(k); in detail (numbers) for specific values of i, j, and k to confirm differences. 详细说明（数字）i，j和k的具体值以确认差异。

Answer 3

The reason is the main thread is doing 64 bit float math and the threads are doing 53 bit math. 原因是主线程正在进行64位浮点运算，线程正在进行53位数学运算。

You can know this / fix it by changing the code to 你可以通过更改代码来了解这个/修复它

...
extern "C" unsigned int _control87( unsigned int newv, unsigned int mask );

DWORD WINAPI thread(LPVOID p)
{
    printf( "_control87(): 0x%.4x\n", _control87( 0, 0 ) );
    _control87(0x00010000,0x00010000);
...

The output will be: 输出将是：

c:\temp>test   
_control87(): 0x8001f
0 - 2667
_control87(): 0x9001f
1 - 2683
_control87(): 0x9001f
_control87(): 0x9001f
2 - 1373

c:\temp>mingw32-c++ --version
mingw32-c++ (GCC) 4.6.2

You can see that 0 was going to run w/o the 0x10000 flag, but once set, runs at the same speed as 1 & 2. If you look up the _control87() function, you'll see that this value is the _PC_53 flag, which sets the precision to be 53 instead of 64 had it been left as zero. 您可以看到0将以0x10000标志运行，但一旦设置，以与1和2相同的速度运行。如果查找_control87()函数，您将看到此值为_PC_53 flag，如果保留为零，则将精度设置为53而不是64。

For some reason, Mingw isn't setting it to the same value at process init time that CreateThread() does at thread create time. 出于某种原因，Mingw没有在CreateThread（）在线程创建时执行的进程初始化时将其设置为相同的值。

Another work around it to turn on SSE2 with _set_SSE2_enable(1) , which will run even faster, but may give different results. 另一项工作是使用_set_SSE2_enable(1)打开SSE2，它将运行得更快，但可能会产生不同的结果。

c:\temp>test   
0 - 1341
1 - 1326
2 - 702

I believe this is on by default for the 64 bit because all 64 bit processors support SSE2. 我相信默认情况下这是64位，因为所有64位处理器都支持SSE2。

Answer 4

As others suggested, change the order of your three tests to get some more insight. 正如其他人建议的那样，更改三个测试的顺序以获得更多洞察力。 Also, the fact that you have a multi-core machine pretty well explains why using two threads doing half the work each takes half the time. 此外，你有一个多核机器的事实很好地解释了为什么使用两个线程做一半的工作，每个花费一半的时间。 Take a look at your CPU usage monitor (Control-Shift-Escape) to find out how many cores are maxed out during the running time. 查看您的CPU使用率监视器（Control-Shift-Escape），了解在运行时间内有多少核心被最大化。

为什么一个线程比调用函数更快，mingw

问题描述

4 个解决方案

解决方案1
6 已采纳 2013-01-16 09:52:21

解决方案2
2 2013-01-15 07:35:35

解决方案3
2 2013-01-16 23:53:51

解决方案4
0 2013-01-15 06:47:10

为什么一个线程比调用函数更快，mingw

问题描述

4 个解决方案

解决方案1 6 已采纳 2013-01-16 09:52:21

解决方案2 2 2013-01-15 07:35:35

解决方案3 2 2013-01-16 23:53:51

解决方案4 0 2013-01-15 06:47:10

解决方案1
6 已采纳 2013-01-16 09:52:21

解决方案2
2 2013-01-15 07:35:35

解决方案3
2 2013-01-16 23:53:51

解决方案4
0 2013-01-15 06:47:10