有人可以解释以下 memory 分配 C 程序的性能行为吗？

Question

On my machine Time A and Time B swap depending on whether A is defined or not (which changes the order in which the two calloc s are called).在我的机器上，时间 A 和时间 B 交换取决于A是否被定义（这改变了调用两个calloc的顺序）。

I initially attributed this to the paging system.我最初将此归因于寻呼系统。 Weirdly, when mmap is used instead of calloc , the situation is even more bizzare -- both the loops take the same amount of time, as expected.奇怪的是，当使用mmap而不是calloc时，情况就更奇怪了——正如预期的那样，两个循环花费相同的时间。 As can be seen with strace , the calloc s ultimately result in two mmap s, so there is no return-already-allocated-memory magic going on.从strace可以看出， calloc最终导致两个mmap ，因此没有返回已分配内存的魔法。

I'm running Debian testing on an Intel i7.我在 Intel i7 上运行 Debian 测试。

#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>

#include <time.h>

#define SIZE 500002816

#ifndef USE_MMAP
#define ALLOC calloc
#else
#define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE,  \
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#endif

int main() {
  clock_t start, finish;
#ifdef A
  int *arr1 = ALLOC(sizeof(int), SIZE);
  int *arr2 = ALLOC(sizeof(int), SIZE);
#else
  int *arr2 = ALLOC(sizeof(int), SIZE);
  int *arr1 = ALLOC(sizeof(int), SIZE);
#endif
  int i;

  start = clock();
  {
    for (i = 0; i < SIZE; i++)
      arr1[i] = (i + 13) * 5;
  }
  finish = clock();

  printf("Time A: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);

  start = clock();
  {
    for (i = 0; i < SIZE; i++)
      arr2[i] = (i + 13) * 5;
  }
  finish = clock();

  printf("Time B: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);

  return 0;
}

The output I get:我得到的 output：

 ~/directory $ cc -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop 
Time A: 0.94
Time B: 0.34
 ~/directory $ cc -DA -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop                               
Time A: 0.34
Time B: 0.90
 ~/directory $ cc -DUSE_MMAP -DA -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop                                          
Time A: 0.89
Time B: 0.90
 ~/directory $ cc -DUSE_MMAP -Wall -O3 bench-loop.c -o bench-loop 
 ~/directory $ ./bench-loop                                      
Time A: 0.91
Time B: 0.92

Answer 1

You should also test using malloc instead of calloc .您还应该使用malloc而不是calloc进行测试。 One thing that calloc does is to fill the allocated memory with zeros. calloc做的一件事是用零填充分配的 memory。

I believe in your case that when you calloc arr1 last and then assign to it, it is already faulted into cache memory, since it was the last one allocated and zero-filled.我相信在你的情况下，当你最后调用 arr1 然后分配给它时，它已经错误地进入缓存calloc ，因为它是最后一个分配和零填充。 When you calloc arr1 first and arr2 second, then the zero-fill of arr2 pushes arr1 out of cache.当您首先calloc然后调用 arr2 时，arr2 的零填充会将 arr1 推出缓存。

Answer 2

Guess I could have written more, or less, especially as less is more.我想我本可以写得更多或更少，尤其是少即是多。

The reason can differ from system to system.原因可能因系统而异。 However;然而; for clib:对于 clib：

The total time used for each operation is the other way around;每个操作所用的总时间是相反的； if you time the calloc + the iteration.如果你计时calloc + 迭代。

Ie: IE：

Calloc arr1 : 0.494992654
Calloc arr2 : 0.000021250
Itr arr1    : 0.430646035
Itr arr2    : 0.790992411
Sum arr1    : 0.925638689
Sum arr2    : 0.791013661

Calloc arr1 : 0.503130736
Calloc arr2 : 0.000025906
Itr arr1    : 0.427719162
Itr arr2    : 0.809686047
Sum arr1    : 0.930849898
Sum arr2    : 0.809711953

The first calloc subsequently malloc has a longer execution time then second.随后的第一个calloc malloc的执行时间比第二个长。 A call as ie malloc(0) before any calloc etc. evens out the time used for malloc like calls in same process (Explanation below).在任何calloc等之前调用 ie malloc(0)会像在同一进程中调用一样平衡用于malloc的时间（下面的解释）。 One can however see an slight decline in time for these calls if one do several in line.然而，如果一个人连续做几个电话，就可以看到这些电话的时间略有下降。

The iteration time, however, will flatten out.然而，迭代时间将会变平。

So in short;简而言之； The total system time used is highest on which ever get alloc'ed first.使用的总系统时间是最先分配的时间。 This is however an overhead that can't be escaped in the confinement of a process.然而，这是在进程限制中无法逃避的开销。

There is a lot of maintenance going on.正在进行大量维护工作。 A quick touch on some of the cases:快速了解一些案例：

Short on page's页面短

When a process request memory it is served a virtual address range.当进程请求 memory 时，它会得到一个虚拟地址范围。 This range translates by a page table to physical memory. If a page translated byte by byte we would quickly get huge page tables.该范围通过页表转换为物理 memory。如果逐字节转换页面，我们很快就会得到巨大的页表。 This, as one, is a reason why memory ranges are served in chunks - or pages.这就是为什么 memory 范围以块或页面的形式提供的原因。 The page size are system dependent.页面大小取决于系统。 The architecture can also provide various page sizes.该体系结构还可以提供各种页面大小。

If we look at execution of above code and add some reads from /proc/PID/stat we see this in action (Esp. note RSS):如果我们查看上面代码的执行并添加一些来自/proc/PID/stat的读取，我们会看到它在运行（Esp.note RSS）：

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 214          Minor faults, (no page memory read)
  UTIME        : 0            Time user mode
  STIME        : 0            Time kernel mode
  VSIZE        : 2039808      Virtual memory size, bytes
  RSS          : 73           Resident Set Size, Number of pages in real memory
} : Init

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 51504        Minor faults, (no page memory read)
  UTIME        : 4            Time user mode
  STIME        : 33           Time kernel mode
  VSIZE        : 212135936    Virtual memory size, bytes
  RSS          : 51420        Resident Set Size, Number of pages in real memory
} : Post calloc arr1

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 51515        Minor faults, (no page memory read)
  UTIME        : 4            Time user mode
  STIME        : 33           Time kernel mode
  VSIZE        : 422092800    Virtual memory size, bytes
  RSS          : 51428        Resident Set Size, Number of pages in real memory
} : Post calloc arr2

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 51516        Minor faults, (no page memory read)
  UTIME        : 36           Time user mode
  STIME        : 33           Time kernel mode
  VSIZE        : 422092800    Virtual memory size, bytes
  RSS          : 51431        Resident Set Size, Number of pages in real memory
} : Post iteration arr1

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 102775       Minor faults, (no page memory read)
  UTIME        : 68           Time user mode
  STIME        : 58           Time kernel mode
  VSIZE        : 422092800    Virtual memory size, bytes
  RSS          : 102646       Resident Set Size, Number of pages in real memory
} : Post iteration arr2

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 102776       Minor faults, (no page memory read)
  UTIME        : 68           Time user mode
  STIME        : 69           Time kernel mode
  VSIZE        : 2179072      Virtual memory size, bytes
  RSS          : 171          Resident Set Size, Number of pages in real memory
} : Post free()

As we can see pages actually allocated in memory is postponed for arr2 awaiting page request;正如我们所看到的，实际在 memory 中分配的页面被推迟到arr2等待页面请求； which lasts until iteration begins.持续到迭代开始。 If we add a malloc(0) before calloc of arr1 we can register that neither array is allocated in physical memory before iteration.如果我们在arr1的calloc之前添加一个malloc(0) ，我们可以注册这两个数组在迭代之前都没有分配到物理 memory。

As a page might not be used it is more efficient to do the mapping on request.由于可能不会使用页面，因此根据请求进行映射会更有效。 This is why when the process ie do a calloc the sufficient number of pages are reserved, but not necessarily actually allocated in real memory.这就是为什么当进程ie做一个calloc时，保留了足够数量的页面，但不一定实际分配到真正的memory。

When an address is referenced the page table is consulted.引用地址时，会查询页表。 If the address is in a page which is not allocated the system serves a page fault and the page is subsequently allocated.如果该地址位于未分配的页面中，系统将提供页面错误，随后分配该页面。 Total sum of allocated pages is called Resident Set Size (RSS).分配页面的总和称为驻留集大小(RSS)。

We can do an experiment with our array by iterating (touching) ie 1/4 of it.我们可以通过迭代（触摸）即它的 1/4 来对我们的数组进行实验。 Here I have also added malloc(0) before any calloc .在这里，我还在任何calloc之前添加了malloc(0) 。

Pre iteration 1/4:
RSS          : 171              Resident Set Size, Number of pages in real meory

for (i = 0; i < SIZE / 4; ++i)
    arr1[i] = 0;

Post iteration 1/4:
RSS          : 12967            Resident Set Size, Number of pages in real meory

Post iteration 1/1:
RSS          : 51134            Resident Set Size, Number of pages in real meory

To further speed up things most systems additionally cache the N most recent page table entries in a translation lookaside buffer (TLB).为了进一步加快速度，大多数系统还会将 N 个最近的页表条目缓存在转换后备缓冲区(TLB) 中。

brk, mmap brk，mmap

When a process (c|m|…)alloc the upper bounds of the heap is expanded by brk() or sbrk() .当进程(c|m|…)alloc时，堆的上限由brk()或sbrk()扩展。 These system calls are expensive and to compensate for this malloc collect multiple smaller calls in to one bigger brk().这些系统调用是昂贵的，为了补偿这个malloc将多个较小的调用收集到一个较大的 brk()。

This also affects free() as a negative brk() also is resource expensive they are collected and performed as a bigger operation.这也会影响free() ，因为负的brk()也是资源昂贵的，它们被收集并作为更大的操作执行。

For huge request;对于巨大的要求； ie like the one in your code, malloc() uses mmap() .即，就像您的代码中的代码一样， malloc()使用mmap() 。 The threshold for this, which is configurable by mallopt() , is an educated value可由mallopt()配置的阈值是一个受过教育的值

We can have fun with this modifying the SIZE in your code.我们可以通过修改代码中的SIZE来获得乐趣。 If we utilize malloc.h and use,如果我们利用malloc.h并使用，

struct mallinfo minf = mallinfo();

(no, not milf ), we can show this (Note Arena and Hblkhd , …): （不，不是milf ），我们可以展示这个（注意Arena和Hblkhd ，...）：

Initial:

mallinfo {
  Arena   :         0 (Bytes of memory allocated with sbrk by malloc)
  Ordblks :         1 (Number of chunks not in use)
  Hblks   :         0 (Number of chunks allocated with mmap)
  Hblkhd  :         0 (Bytes allocated with mmap)
  Uordblks:         0 (Memory occupied by chunks handed out by malloc)
  Fordblks:         0 (Memory occupied by free chunks)
  Keepcost:         0 (Size of the top-most releasable chunk)
} : Initial

MAX = ((128 * 1024) / sizeof(int)) 

mallinfo {
  Arena   :         0 (Bytes of memory allocated with sbrk by malloc)
  Ordblks :         1 (Number of chunks not in use)
  Hblks   :         1 (Number of chunks allocated with mmap)
  Hblkhd  :    135168 (Bytes allocated with mmap)
  Uordblks:         0 (Memory occupied by chunks handed out by malloc)
  Fordblks:         0 (Memory occupied by free chunks)
  Keepcost:         0 (Size of the top-most releasable chunk)
} : After malloc arr1

mallinfo {
  Arena   :         0 (Bytes of memory allocated with sbrk by malloc)
  Ordblks :         1 (Number of chunks not in use)
  Hblks   :         2 (Number of chunks allocated with mmap)
  Hblkhd  :    270336 (Bytes allocated with mmap)
  Uordblks:         0 (Memory occupied by chunks handed out by malloc)
  Fordblks:         0 (Memory occupied by free chunks)
  Keepcost:         0 (Size of the top-most releasable chunk)
} : After malloc arr2

Then we subtract sizeof(int) from MAX and get:然后我们从MAX中减去sizeof(int)得到：

mallinfo {
  Arena   :    266240 (Bytes of memory allocated with sbrk by malloc)
  Ordblks :         1 (Number of chunks not in use)
  Hblks   :         0 (Number of chunks allocated with mmap)
  Hblkhd  :         0 (Bytes allocated with mmap)
  Uordblks:    131064 (Memory occupied by chunks handed out by malloc)
  Fordblks:    135176 (Memory occupied by free chunks)
  Keepcost:    135176 (Size of the top-most releasable chunk)
} : After malloc arr1

mallinfo {
  Arena   :    266240 (Bytes of memory allocated with sbrk by malloc)
  Ordblks :         1 (Number of chunks not in use)
  Hblks   :         0 (Number of chunks allocated with mmap)
  Hblkhd  :         0 (Bytes allocated with mmap)
  Uordblks:    262128 (Memory occupied by chunks handed out by malloc)
  Fordblks:      4112 (Memory occupied by free chunks)
  Keepcost:      4112 (Size of the top-most releasable chunk)
} : After malloc arr2

We register that the system works as advertised.我们注册该系统如广告所示工作。 If size of allocation is below threshold sbrk is used and memory handled internally by malloc , else mmap is used.如果分配大小低于阈值，则使用sbrk并且 memory 由malloc内部处理，否则使用mmap 。

The structure of this also helps on preventing fragmentation of memory etc.这种结构也有助于防止 memory 等的碎片化。

Point being that the malloc family is optimized for general usage.重点是malloc系列针对一般用途进行了优化。 However mmap limits can be modified to meet special needs.然而，可以修改mmap限制以满足特殊需要。

Note this (and down trough 100+ lines) when / if modifying mmap threshold.当/如果修改 mmap 阈值时，请注意这一点（以及 100 多行）。 . .

This can be further observed if we fill (touch) every page of arr1 and arr2 before we do the timing:如果我们在计时之前填充（触摸）arr1 和 arr2 的每一页，可以进一步观察到这一点：

Touch pages … (Here with page size of 4 kB)

for (i = 0; i < SIZE; i += 4096 / sizeof(int)) {
    arr1[i] = 0;
    arr2[i] = 0;
}

Itr arr1    : 0.312462317
CPU arr1    : 0.32

Itr arr2    : 0.312869158
CPU arr2    : 0.31

Also see:另见：

Synopsis of compile-time options 编译时选项概要
Vital statistics 重要统计数据
… actually the entire file is a nice read. …实际上整个文件都很好读。

Sub notes:子注释：

So, the CPU knows the physical address then?那么，CPU 知道物理地址吗？ Nah.不。

In the world of memory a lot has to be addressed ;).在 memory 的世界里，有很多事情需要解决;)。 A core hardware for this is the memory management unit (MMU).其核心硬件是memory 管理单元(MMU)。 Either as an integrated part of the CPU or external chip.作为 CPU 的集成部分或外部芯片。

The operating system configure the MMU on boot and define access for various regions (read only, read-write, etc.) thus giving a level of security.操作系统在启动时配置 MMU 并定义对不同区域的访问（只读、读写等），从而提供一定程度的安全性。

The address we as mortals see is the logical address that the CPU uses.我们一般人看到的地址是 CPU 使用的逻辑地址。 The MMU translates this to a physical address . MMU 将其转换为物理地址。

The CPU's address consist of two parts: a page address and a offset. CPU的地址由两部分组成：页地址和偏移量。 [PAGE_ADDRESS.OFFSET] [PAGE_ADDRESS.OFFSET]

And the process of getting a physical address we can have something like:获取物理地址的过程可以是这样的：

.-----.                          .--------------.
| CPU > --- Request page 2 ----> | MMU          |
+-----+                          | Pg 2 == Pg 4 |
      |                          +------v-------+
      +--Request offset 1 -+            |
                           |    (Logical page 2 EQ Physical page 4)
[ ... ]     __             |            |
[ OFFSET 0 ]  |            |            |
[ OFFSET 1 ]  |            |            |
[ OFFSET 2 ]  |            |            |     
[ OFFSET 3 ]  +--- Page 3  |            |
[ OFFSET 4 ]  |            |            |
[ OFFSET 5 ]  |            |            |
[ OFFSET 6 ]__| ___________|____________+
[ OFFSET 0 ]  |            |
[ OFFSET 1 ]  | ...........+
[ OFFSET 2 ]  |
[ OFFSET 3 ]  +--- Page 4
[ OFFSET 4 ]  |
[ OFFSET 5 ]  |
[ OFFSET 6 ]__|
[ ... ]

A CPU's logical address space is directly linked to the address length. CPU 的逻辑地址空间与地址长度直接相关。 A 32-bit address processor has a logical address space of 2 ³² bytes.一个 32 位地址处理器有一个 2 ³²字节的逻辑地址空间。 The physical address space is how much memory the system can afford.物理地址空间是系统能承受多少memory。

There is also the handling of fragmented memory, re-alignment etc.还有对分片memory的处理，重新对齐等。

This brings us into the world of swap files.这将我们带入了交换文件的世界。 If a process request more memory then is physically available;如果进程请求更多 memory 那么物理可用； one or several pages of other process(es) are transfered to disk/swap and their pages " stolen " by the requesting process.其他进程的一个或多个页面被传输到磁盘/交换，并且它们的页面被请求进程“窃取”。 The MMU keeps track of this; MMU 对此进行跟踪； thus the CPU doesn't have to worry about where the memory is actually located.因此 CPU 不必担心 memory 的实际位置。

This further brings us on to dirty memory.这进一步将我们带到了肮脏的 memory。

If we print some information from /proc/[pid]/smaps, more specific the range of our arrays we get something like:如果我们从 /proc/[pid]/smaps 打印一些信息，更具体地说是我们的 arrays 的范围，我们会得到如下信息：

Start:
b76f3000-b76f5000
Private_Dirty:         8 kB

Post calloc arr1:
aaeb8000-b76f5000
Private_Dirty:        12 kB

Post calloc arr2:
9e67c000-b76f5000
Private_Dirty:        20 kB

Post iterate 1/4 arr1
9e67b000-b76f5000
Private_Dirty:     51280 kB

Post iterate arr1:
9e67a000-b76f5000
Private_Dirty:    205060 kB

Post iterate arr2:
9e679000-b76f5000
Private_Dirty:    410096 kB

Post free:
9e679000-9e67d000
Private_Dirty:        16 kB
b76f2000-b76f5000
Private_Dirty:        12 kB

When a virtual page is created a system typically clears a dirty bit in the page.创建虚拟页面时，系统通常会清除页面中的脏位。
When the CPU writes to a part of this page the dirty bit is set;当 CPU 写入该页的一部分时，脏位被设置； thus when swapped the pages with dirty bits are written, clean pages are skipped.因此，当交换带有脏位的页面时，将跳过干净的页面。

Answer 3

It's just a matter of when the process memory image expands by a page.这只是进程 memory 图像何时扩展一页的问题。

Answer 4

Short Answer简答

The first time that calloc is called it is explicitly zeroing out the memory. While the next time that it is called it assumed that the memory returned from mmap is already zeroed out.第一次调用calloc时，它明确地将 memory 清零。而在下次调用它时，它假定从mmap返回的 memory 已经清零。

Details细节

Here's some of the things that I checked to come to this conclusion that you could try yourself if you wanted:以下是我检查得出的一些结论，如果你愿意，你可以自己尝试：

Insert a calloc call before your first ALLOC call.在您的第一个ALLOC调用之前插入一个calloc调用。 You will see that after this the Time for Time A and Time B are the same.您会看到，此后时间 A 和时间 B 的时间相同。
Use the clock() function to check how long each of the ALLOC calls take.使用clock() function 检查每个ALLOC调用需要多长时间。 In the case where they are both using calloc you will see that the first call takes much longer than the second one.在他们都使用calloc的情况下，您会看到第一个调用比第二个调用花费的时间长得多。
Use time to time the execution time of the calloc version and the USE_MMAP version.使用time来计时calloc版本和USE_MMAP版本的执行时间。 When I did this I saw that the execution time for USE_MMAP was consistently slightly less.当我这样做时，我发现USE_MMAP的执行时间始终稍短。
I ran with strace -tt -T which shows both the time of when the system call was made and how long it took.我用strace -tt -T运行，它显示了系统调用的时间和花费的时间。 Here is part of the output:这是 output 的一部分：

Strace output:跟踪 output：

21:29:06.127536 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff806fd000 <0.000014>
21:29:07.778442 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff093a0000 <0.000021>
21:29:07.778563 times({tms_utime=63, tms_stime=102, tms_cutime=0, tms_cstime=0}) = 4324241005 <0.000011>

You can see that the first mmap call took 0.000014 seconds, but that about 1.5 seconds elapsed before the next system call.您可以看到第一个mmap调用花费了0.000014秒，但在下一个系统调用之前已经过去了大约1.5秒。 Then the second mmap call took 0.000021 seconds, and was followed by the times call a few hundred microsecond later.然后第二次mmap调用用了0.000021秒，然后是几百微秒后的times调用。

I also stepped through part of the application execution with gdb and saw that the first call to calloc resulted in numerous calls to memset while the second call to calloc did not make any calls to memset .我还通过gdb执行了部分应用程序，发现第一次调用calloc导致多次调用memset ，而第二次调用calloc没有调用memset 。 You can see the source code for calloc here (look for __libc_calloc ) if you are interested.如果您有兴趣，可以在此处查看calloc的源代码（查找__libc_calloc ）。 As for why calloc is doing the memset on the first call but not subsequent ones I don't know.至于为什么calloc在第一次调用时做memset而不是后续调用我不知道。 But I feel fairly confident that this explains the behavior you have asked about.但我相当有信心这可以解释您所询问的行为。

As for why the array that was zeroed memset has improved performance my guess is that it is because of values being loaded into the TLB rather than the cache since it is a very large array.至于为什么被memset归零的数组提高了性能，我的猜测是因为值被加载到 TLB 而不是缓存中，因为它是一个非常大的数组。 Regardless the specific reason for the performance difference that you asked about is that the two calloc calls behave differently when they are executed.无论您询问的性能差异的具体原因是什么，这两个calloc调用在执行时的行为不同。

Answer 5

Summary : The time difference is explained when analysing the time is takes to allocate the arrays .总结：在分析分配arrays所用时间时说明了时间差。 The last allocated calloc takes just a bit more time whereas the other (or all when using mmap) take virtualy no time.最后分配的 calloc 只需要多一点时间，而另一个（或所有使用 mmap 时）几乎没有时间。 The real allocation in memory is probably deferred when first accessed. memory 中的实际分配可能在首次访问时延迟。

I don't know enough about the internal of memory allocation on Linux. But I ran your script slightly modified: I've added a third array and some extra iterations per array operations.我对 Linux 上 memory 分配的内部了解不够。但我运行你的脚本稍作修改：我添加了第三个数组和每个数组操作的一些额外迭代。 And I have taken into account the remark of Old Pro that the time to allocate the arrays was not taken into account.并且我已经考虑到老亲说分配arrays的时间没有考虑在内。

Conclusion: Using calloc takes longer than using mmap for the allocation (mmap virtualy uses no time when you allocate the memory, it's probably postponed later when fist accessed), and using my program there is almost no difference in the end between using mmap or calloc for the overall program execution.结论：使用calloc比使用mmap进行分配需要更长的时间（分配memory时mmap实际上不使用时间，当第一次访问时可能会推迟），并且使用我的程序最终使用mmap或calloc几乎没有区别用于整体程序执行。

Anyway, first remark, both memory allocation happen in the memory mapping region and not in the heap.不管怎样，首先要说明的是，memory 分配都发生在 memory 映射区域而不是堆中。 To verify this, I've added a quick n' dirty pause so you can check the memory mapping of the process (/proc//maps)为了验证这一点，我添加了一个 quick n' dirty pause 以便您可以检查进程的 memory 映射 (/proc//maps)

Now to your question, the last allocated array with calloc seems to be really allocated in memory (not postponed).现在回答你的问题，最后一个用 calloc 分配的数组似乎真的是在 memory 中分配的（不是推迟）。 As arr1 and arr2 behaves now exactly the same (the first iteration is slow, subsequent iterations are faster).由于 arr1 和 arr2 现在的行为完全相同（第一次迭代很慢，后续迭代更快）。 Arr3 is faster for the first iteration because the memory was allocated earlier. Arr3 的第一次迭代速度更快，因为 memory 分配得更早。 When using the A macro, then it is arr1 which benefits from this.当使用 A 宏时，arr1 从中受益。 My guess would be that the kernel has preallocated the array in memory for the last calloc.我的猜测是 kernel 已经为最后一个 calloc 预分配了 memory 中的数组。 Why?为什么？ I don't know... I've tested it also with only one array (so I removed all occurence of arr2 and arr3), then I have the same time (roughly) for all 10 iterations of arr1.我不知道...我也只用一个数组测试过它（所以我删除了所有出现的 arr2 和 arr3），然后我对 arr1 的所有 10 次迭代都有相同的时间（大致）。

Both malloc and mmap behave the same (results not shown below), the first iteration is slow, subsequent iterations are faster for all 3 arrays. malloc 和 mmap 的行为相同（结果未在下面显示），第一次迭代很慢，后续迭代对于所有 3 个 arrays 都更快。

Note: all results were coherent accross the various gcc optimisation flags (-O0 to -O3), so it doesn't look like the root of the behaviour is derived from some kind of gcc optimsation.注意：所有结果在各种 gcc 优化标志（-O0 到 -O3）中都是一致的，因此看起来行为的根源并不是来自某种 gcc 优化。

Note2: Test run on Ubuntu Precise Pangolin (kernel 3.2), with GCC 4.6.3注 2：在 Ubuntu Precise Pangolin（内核 3.2）上测试运行，使用 GCC 4.6.3

#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>

#include <time.h>

#define SIZE 500002816
#define ITERATION 10

#if defined(USE_MMAP)
#  define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE,  \
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#elif defined(USE_MALLOC)
#  define ALLOC(a, b) (malloc(b * a))
#elif defined(USE_CALLOC)
#  define ALLOC calloc
#else
#  error "No alloc routine specified"
#endif

int main() {
  clock_t start, finish, gstart, gfinish;
  start = clock();
  gstart = start;
#ifdef A
  unsigned int *arr1 = ALLOC(sizeof(unsigned int), SIZE);
  unsigned int *arr2 = ALLOC(sizeof(unsigned int), SIZE);
  unsigned int *arr3 = ALLOC(sizeof(unsigned int), SIZE);
#else
  unsigned int *arr3 = ALLOC(sizeof(unsigned int), SIZE);
  unsigned int *arr2 = ALLOC(sizeof(unsigned int), SIZE);
  unsigned int *arr1 = ALLOC(sizeof(unsigned int), SIZE);
#endif
  finish = clock();
  unsigned int i, j;
  double intermed, finalres;

  intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
  printf("Time to create: %.2f\n", intermed);

  printf("arr1 addr: %p\narr2 addr: %p\narr3 addr: %p\n", arr1, arr2, arr3);

  finalres = 0;
  for (j = 0; j < ITERATION; j++)
  {
    start = clock();
    {
      for (i = 0; i < SIZE; i++)
        arr1[i] = (i + 13) * 5;
    }
    finish = clock();

    intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
    finalres += intermed;
    printf("Time A: %.2f\n", intermed);
  }

  printf("Time A (average): %.2f\n", finalres/ITERATION);


  finalres = 0;
  for (j = 0; j < ITERATION; j++)
  {
    start = clock();
    {
      for (i = 0; i < SIZE; i++)
        arr2[i] = (i + 13) * 5;
    }
    finish = clock();

    intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
    finalres += intermed;
    printf("Time B: %.2f\n", intermed);
  }

  printf("Time B (average): %.2f\n", finalres/ITERATION);


  finalres = 0;
  for (j = 0; j < ITERATION; j++)
  {
    start = clock();
    {
      for (i = 0; i < SIZE; i++)
        arr3[i] = (i + 13) * 5;
    }
    finish = clock();

    intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
    finalres += intermed;
    printf("Time C: %.2f\n", intermed);
  }

  printf("Time C (average): %.2f\n", finalres/ITERATION);

  gfinish = clock();

  intermed = ((double)(gfinish - gstart))/CLOCKS_PER_SEC;
  printf("Global Time: %.2f\n", intermed);

  return 0;
}

Results:结果：

Using USE_CALLOC使用 USE_CALLOC

 Time to create: 0.13 arr1 addr: 0x7fabcb4a6000 arr2 addr: 0x7fabe917d000 arr3 addr: 0x7fac06e54000 Time A: 0.67 Time A: 0.48... Time A: 0.47 Time A (average): 0.48 Time B: 0.63 Time B: 0.47... Time B: 0.48 Time B (average): 0.48 Time C: 0.45... Time C: 0.46 Time C (average): 0.46

With USE_CALLOC and A使用 USE_CALLOC 和 A

 Time to create: 0.13 arr1 addr: 0x7fc2fa206010 arr2 addr: 0xx7fc2dc52e010 arr3 addr: 0x7fc2be856010 Time A: 0.44... Time A: 0.43 Time A (average): 0.45 Time B: 0.65 Time B: 0.47... Time B: 0.46 Time B (average): 0.48 Time C: 0.65 Time C: 0.48... Time C: 0.45 Time C (average): 0.48

Using USE_MMAP使用 USE_MMAP

 Time to create: 0.0 arr1 addr: 0x7fe6332b7000 arr2 addr: 0x7fe650f8e000 arr3 addr: 0x7fe66ec65000 Time A: 0.55 Time A: 0.48... Time A: 0.45 Time A (average): 0.49 Time B: 0.54 Time B: 0.46... Time B: 0.49 Time B (average): 0.50 Time C: 0.57... Time C: 0.40 Time C (average): 0.43

有人可以解释以下 memory 分配 C 程序的性能行为吗？

问题描述

5 个解决方案

解决方案1
10 2012-04-10 20:59:07

解决方案2
6 2012-04-18 17:17:39

Short on page's页面短

brk, mmap brk，mmap

Sub notes:子注释：

解决方案3
3 2012-04-10 20:57:58

解决方案4
3 已采纳 2012-04-18 21:50:21

解决方案5
2 2012-04-16 13:18:26

有人可以解释以下 memory 分配 C 程序的性能行为吗？

问题描述

5 个解决方案

解决方案1 10 2012-04-10 20:59:07

解决方案2 6 2012-04-18 17:17:39

Short on page's页面短

brk, mmap brk，mmap

Sub notes:子注释：

解决方案3 3 2012-04-10 20:57:58

解决方案4 3 已采纳 2012-04-18 21:50:21

解决方案5 2 2012-04-16 13:18:26

解决方案1
10 2012-04-10 20:59:07

解决方案2
6 2012-04-18 17:17:39

解决方案3
3 2012-04-10 20:57:58

解决方案4
3 已采纳 2012-04-18 21:50:21

解决方案5
2 2012-04-16 13:18:26