为什么在堆上迭代大型数组比在堆栈上迭代相同大小的数组更快？

Question

I am allocating 2 same size arrays, one on stack, one on heap, then iterating over them with trivial assignment. 我正在分配2个相同大小的数组，一个在堆栈上，一个在堆上，然后用简单的赋值迭代它们。

Executable is compiled to allocate 40mb for main thread stack. 编译可执行文件以为主线程堆栈分配40mb。

This code has only been tested to compile in vc++ with /STACK:41943040 linker tag. 此代码仅在vc ++中使用/ STACK：41943040链接器标记进行了测试。

#include "stdafx.h"
#include <string>
#include <iostream>
#include <malloc.h>
#include <windows.h>
#include <ctime>

using namespace std;

size_t stackavail()
{
    static unsigned StackPtr;   // top of stack ptr
    __asm mov [StackPtr],esp    // mov pointer to top of stack
    static MEMORY_BASIC_INFORMATION mbi;            // page range
    VirtualQuery((PVOID)StackPtr,&mbi,sizeof(mbi)); // get range
    return StackPtr-(unsigned)mbi.AllocationBase;   // subtract from top (stack grows downward on win)
}

int _tmain(int argc, _TCHAR* argv[])
{
    string input;

    cout << "Allocating 22mb on stack." << endl;
    unsigned int start = clock();
    char eathalfastack[23068672]; // approx 22mb
    auto length = sizeof(eathalfastack)/sizeof(char);
    cout << "Time taken in ms: " << clock()-start << endl;

    cout << "Setting through array." << endl;
    start = clock();
    for( int i = 0; i < length; i++ ){
        eathalfastack[i] = i;
    }
    cout << "Time taken in ms: " << clock()-start << endl;
    cout << "Free stack space: " << stackavail() << endl;


    cout << "Allocating 22mb on heap." << endl;
    start = clock();
    // auto* heaparr = new int[23068672]; // corrected
    auto* heaparr = new byte[23068672];
    cout << "Time taken in ms: " << clock()-start << endl;

    start = clock();
    cout << "Setting through array." << endl;
    for( int i = 0; i < length; i++ ){
        heaparr[i] = i;
    }
    cout << "Time taken in ms: " << clock()-start << endl;

    delete[] heaparr;
    getline(cin, input);
}

The output is this: 输出是这样的：

    Allocating 22mb on stack.
    Time taken in ms: 0
    Setting through array.
    Time taken in ms: 45
    Free stack space: 18872076
    Allocating 22mb on heap.
    Time taken in ms: 20
    Setting through array.
    Time taken in ms: 35

Why is iteration of stack array slower than same thing on heap? 为什么堆栈数组的迭代速度比堆上的相同？

EDIT: nneonneo cought my error 编辑：nneonneo cought我的错误

Now output is identical: 现在输出完全相同：

    Allocating 22mb on stack.
    Time taken in ms: 0
    Setting through array.
    Time taken in ms: 42
    Free stack space: 18871952
    Allocating 22mb on heap.
    Time taken in ms: 4
    Setting through array.
    Time taken in ms: 41

Release build per Öö Tiib's answer below: 根据ÖöTiib的答案发布版本如下：

    Allocating 22mb on stack.
    Time taken in ms: 0
    Setting through array.
    Time taken in ms: 5
    Free stack space: 18873508
    Allocating 22mb on heap.
    Time taken in ms: 0
    Setting through array.
    Time taken in ms: 10

Answer 1

Your arrays are not the same size; 你的阵列大小不一样; sizeof(char[23068672]) != sizeof(int[23068672]) , and the elements are of different types. sizeof(char[23068672]) != sizeof(int[23068672]) ，元素属于不同类型。

Answer 2

Something is wrong with your PC, on mine ages old Pentium 4 it takes 15 ms to assign such stack-based char array. 您的PC有问题，在我的Pentium 4老化时，需要15 ms来分配这种基于堆栈的char数组。 Did you try with debug version or something? 你试过调试版还是什么？

Answer 3

There are two parts to your question : 您的问题分为两部分：

Allocating space on the stack vs heap 在堆栈上分配空间与堆
Accessing a memory location on stack vs globally visible 访问堆栈上的内存位置与全局可见

Allocating space 分配空间

First, lets look at allocating space on the stack. 首先，让我们看看在堆栈上分配空间。 The stack as we know grows downwards on the x86 architecture. 我们知道的堆栈在x86架构上向下发展。 So, in order to allocate space on the stack, all you have to do is decrement the stack pointer. 因此，为了在堆栈上分配空间，您所要做的就是减少堆栈指针。 Just one assembly instruction (dec sp, #amount). 只有一个汇编指令（dec sp，＃amount）。 This assembly instruction is always present in the prologue of a function (function set-up code). 该汇编指令始终存在于函数的序言中（函数设置代码）。 So, as far as I know, allocating space on stack must not take any time. 所以，据我所知，在堆栈上分配空间不能占用任何时间。 Cost of allocating space on stack = ( decrement sp operation). 在堆栈上分配空间的成本=（递减sp操作）。 On a modern super-scalar machine, this execution of this instruction will be overlapped with other instructions. 在现代超标量机器上，此指令的执行将与其他指令重叠。

Allocating space on the heap on the other hand requires a library call to new/malloc. 另一方面，在堆上分配空间需要对new / malloc进行库调用。 The library call first checks if there is some space on the heap. 库调用首先检查堆上是否有一些空间。 If yes then it will just return a pointer to the first available address. 如果是，则它将返回指向第一个可用地址的指针。 If space is not available on the stack, it will use a brk system call to request kernel to modify the page-table entries for the additional page. 如果堆栈上没有空间，它将使用brk系统调用来请求内核修改其他页面的页表条目。 A system call is a costly operation. 系统调用是一项代价高昂的操作。 It will cause a pipeline flush, TLB pollution, etc. So, the cost of allocating space on heap = (function-call + computation for space + (brk system call)?). 它会导致管道冲洗，TLB污染等。因此，在堆上分配空间的成本=（函数调用+空间计算+（brk系统调用）？）。 Definitely, allocating space on heap seems order of magnitude slower than stack. 当然，在堆上分配空间似乎比堆栈慢一个数量级。

Accessing element The addressing mode of the x86 ISA allows memory operand to be addressed using direct addressing mode (temp=mem[addr]) to access a global variable while the variables on stack are generally accessed using indexed addressing mode. 访问元素 x86 ISA的寻址模式允许使用直接寻址模式（temp = mem [addr]）寻址内存操作数以访问全局变量，而堆栈上的变量通常使用索引寻址模式进行访问。 (temp=mem[stack-pointer+offset-on-stack]). （温度= MEM [堆栈指针+偏移堆栈上]）。 My assumption is that both the memory operands should take almost the same time however, the direct addressing mode seems definitely faster than the indexed addressing mode. 我的假设是两个存储器操作数应该几乎相同的时间，但直接寻址模式似乎肯定比索引寻址模式更快。 Regarding the memory access of an array, we have two operands to access any element - base address of array and index variable. 关于数组的内存访问，我们有两个操作数来访问任何元素 - 数组的基地址和索引变量。 When we are accessing an array on stack, we add one more operand - the stack - pointer . 当我们在堆栈上访问数组时，我们再添加一个操作数 - 堆栈 - 指针。 The x86 addressing mode has a provision for such addresses - base+scale*index+offset . x86寻址模式提供了这样的地址 - base + scale * index + offset。 So, okay stack array element access : temp=mem[sp+base-address+iterator*element-size] and heap array access : temp=mem[base-address+iterator*element-size]. 所以，好的堆栈数组元素访问：temp = mem [sp + base-address + iterator * element-size]和堆数组访问：temp = mem [base-address + iterator * element-size]。 Clearly, the stack access must the costlier than the array access. 显然，堆栈访问必须比数组访问更昂贵。

Now, coming to a generic case of iteration, if the iteration is slower for stack, it means addressing mode may (i am not completely sure) the bottle-neck and if allocating the space is bottleneck, the system call may be the bottleneck. 现在，进入迭代的一般情况，如果迭代对于堆栈来说较慢，则意味着寻址模式可能（我不完全确定）瓶颈，如果分配空间是瓶颈，则系统调用可能是瓶颈。

为什么在堆上迭代大型数组比在堆栈上迭代相同大小的数组更快？

问题描述

3 个解决方案

解决方案1
9 2012-10-16 04:10:22

解决方案2
1 已采纳 2012-10-16 04:32:27

解决方案3
0 2012-10-16 04:54:47

为什么在堆上迭代大型数组比在堆栈上迭代相同大小的数组更快？

问题描述

3 个解决方案

解决方案1 9 2012-10-16 04:10:22

解决方案2 1 已采纳 2012-10-16 04:32:27

解决方案3 0 2012-10-16 04:54:47

解决方案1
9 2012-10-16 04:10:22

解决方案2
1 已采纳 2012-10-16 04:32:27

解决方案3
0 2012-10-16 04:54:47