简体   繁体   English

为什么BSS中静态数组上的第二个循环比第一个循环快?

[英]Why is the second loop over a static array in the BSS faster than the first?

I have the following code that writes a global array with zeros twice, once forward and once backward. 我有以下代码,用两次零一次,一次向前,一次一次向后写入一个全局数组。

#include <string.h>
#include <time.h>
#include <stdio.h>
#define SIZE 100000000

char c[SIZE];
char c2[SIZE];

int main()
{
   int i;
   clock_t t = clock();
   for(i = 0; i < SIZE; i++)
       c[i] = 0;

   t = clock() - t;
   printf("%d\n\n", t);

   t = clock(); 
   for(i = SIZE - 1; i >= 0; i--)
      c[i] = 0;

   t = clock() - t;
   printf("%d\n\n", t);
}

I've run it a couple and the second print is always showing a smaller value... 我已经运行了几次,第二张打印总是显示较小的值...

However, if I change change c to c2 in one of the loops, the time difference between both prints becomes negligible... what is the reason for that difference? 但是,如果我在一个循环中将更改c更改为c2,则两次打印之间的时间差可以忽略不计...造成这种差异的原因是什么?

EDIT: 编辑:

I've tried compiling with -O3 and looked into the assembly: there were 2 calls to memset but the second was still printing a smaller value. 我尝试使用-O3进行编译,并查看了程序集: 有2个对memset的调用,但是第二个仍在打印较小的值。

When you defined some global data in C, it is zero-initialized: 当您在C中定义一些全局数据时,它会被零初始化:

char c[SIZE];
char c2[SIZE];

In linux (unix) world this means, than both c and c2 will be allocated in special ELF file section, the .bss : 在linux(unix)世界中,这意味着cc2都将被分配在特殊的ELF文件部分.bss

... data segment containing statically-allocated variables represented solely by zero-valued bits initially ...包含静态分配的变量的数据段,该变量最初仅由零值位表示

The .bss segment is created to not store all zeroes in the binary, it just says something like "this program wants to have 200MB of zeroed memory". 创建.bss段是为了不将所有零存储在二进制文件中,它只是说“该程序希望具有200MB的清零内存”之类的内容。

When you program is loaded, ELF loader (kernel in case of classic static binaries, or ld.so dynamic loader also known as interp ) will allocate the memory for .bss , usually like something like mmap with MAP_ANONYMOUS flag and READ+WRITE permissions/protection request. 加载程序时,ELF加载器(在经典静态二进制文件中为内核,或者ld.so动态加载器也称为interp )将为.bss分配内存,通常类似于带有MAP_ANONYMOUS标志和READ + WRITE权限的mmap类的东西。保护要求。

But memory manager in the OS kernel will not give you all 200 MB of zero memory. 但是OS内核中的内存管理器不会为您提供全部200 MB的零内存。 Instead it will mark part of virtual memory of your process as zero-initialized, and every page of this memory will point to the special zero page in physical memory. 而是将进程的虚拟内存的一部分标记为零初始化,并且此内存的每一页都指向物理内存中的特殊零页。 This page has 4096 bytes of zero byte, so if you are reading from c or c2 , you will get zero bytes; 此页面有4096个字节的零字节,因此,如果您从cc2读取,将得到零字节; and this mechanism allow kernel cut down memory requirements. 并且这种机制允许内核减少内存需求。

The mappings to zero page are special; 到零页面的映射是特殊的。 they are marked (in page table ) as read-only. 它们在页面表中被标记为只读。 When you do first write to the any of such virtual pages, the General protection fault or pagefault exception will be generated by hardware (I'll say, by MMU and TLB). 当您第一次写入任何此类虚拟页面时, 常规保护错误页面 错误异常将由硬件(我将说是MMU和TLB)生成。 This fault will be handled by kernel, and in your case, by minor pagefault handler. 该错误将由内核处理,在您的情况下,将由次要的pagefault处理程序处理。 It will allocate one physical page, fill it by zero bytes, and reset mapping of just-accesed virtual page to this physical page. 它将分配一个物理页面,将其填充为零字节,然后将刚刚访问的虚拟页面的映射重置为该物理页面。 Then it will rerun faulted instruction. 然后它将重新运行错误的指令。

I converted your code a bit (both loops are moved to separate function): 我稍微转换了一下代码(两个循环都移到了单独的函数中):

$ cat b.c
#include <string.h>
#include <time.h>
#include <stdio.h>
#define SIZE 100000000

char c[SIZE];
char c2[SIZE];

void FIRST()
{
   int i;
   for(i = 0; i < SIZE; i++)
       c[i] = 0;
}

void SECOND()
{
   int i;
   for(i = 0; i < SIZE; i++)
       c[i] = 0;
}


int main()
{
   int i;
   clock_t t = clock();
   FIRST();
   t = clock() - t;
   printf("%d\n\n", t);

   t = clock(); 
   SECOND();

   t = clock() - t;
   printf("%d\n\n", t);
}

Compile with gcc bc -fno-inline -O2 -ob , then run under linux's perf stat or more generic /usr/bin/time to get pagefault count: gcc bc -fno-inline -O2 -ob编译,然后在linux的perf stat或更通用的/usr/bin/time以获取pagefault计数:

$ perf stat ./b
139599

93283


 Performance counter stats for './b':
 ....
            24 550 page-faults               #    0,100 M/sec           


$ /usr/bin/time ./b
234246

92754

Command exited with non-zero status 7
0.18user 0.15system 0:00.34elapsed 99%CPU (0avgtext+0avgdata 98136maxresident)k
0inputs+8outputs (0major+24576minor)pagefaults 0swaps

So, we have 24,5 thousands of minor pagefaults. 因此,我们有24.5万次次要的页面错误。 With standard page size on x86/x86_64 of 4096 this is near 100 megabytes. 如果x86 / x86_64上的标准页面大小为4096,则接近100兆字节。

With perf record / perf report linux profiler we can find, where pagefaults occur (are generated): 使用性能perf record /性能perf report linux profiler,我们可以找到发生页面错误的位置:

$ perf record -e page-faults ./b
...skip some spam from non-root run of perf...
213322

97841

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB perf.data (~801 samples) ]

$ perf report -n |cat
...
# Samples: 467  of event 'page-faults'
# Event count (approx.): 24583
#
# Overhead       Samples  Command      Shared Object                   Symbol
# ........  ............  .......  .................  .......................
#
    98.73%           459        b  b                  [.] FIRST              
     0.81%             1        b  libc-2.19.so       [.] __new_exitfn       
     0.35%             1        b  ld-2.19.so         [.] _dl_map_object_deps
     0.07%             1        b  ld-2.19.so         [.] brk                
     ....

So, now we can see, that only FIRST function generates pagefaults (on first write to bss pages), and SECOND does not generate any. 因此,现在我们可以看到,只有FIRST函数会生成pagefault(第一次写入bss页时),而SECOND不会生成任何页面错误。 Every pagefault corresponds to some work, done by OS kernel, and this work is done only one time per page of bss (because bss is not unmapped and remapped back). 每个pagefault都对应于OS内核完成的某些工作,并且该工作仅每bss页完成一次(因为bss未被取消映射并重新映射回去)。

Following asimes answer that it's due to caching - i'm not convinced that you can enjoy the benefit of caches with a ~100M array, you're likely to completely thrash out any useful data before returning there. 在出现动画问题后,这是由于缓存造成的-我不敢相信您可以享受〜100M阵列带来的缓存好处,因此很可能在返回任何有用数据之前会先将其丢弃。

However, depending on your platform (OS mostly), there are other mechanisms as work - when you allocate the arrays you never initialize them, so the first loop probably incurs the penalty of the first access per each 4k page. 但是,取决于您的平台(主要是OS),还有其他工作原理-分配数组时,您永远不会对其进行初始化,因此,第一个循环可能会导致每4k页第一次访问的代价。 This usually would cause some assist of a syscall that comes with a high overhead. 这通常会导致系统调用的某些辅助操作,而该调用具有很高的开销。
In this case you also modify the page, so most system would be forced to perform a copy-on-write flow (an optimization that works as long as you read only from a page), this is even heavier. 在这种情况下,您还修改了页面,因此大多数系统将被迫执行写时复制流程(只要您仅从页面中读取就可以进行优化),这甚至会更重。

Adding a small access per page (which should be negligible with regards to actual caching and it only fetches one 64B line out of each 4k page), managed to make the results more even on my system (although this form of measurement isn't very accurate to begin with) 在每页上添加少量访问权限(相对于实际缓存而言可以忽略不计,并且仅从每4k页中获取一个64B行),设法使结果在我的系统上更均匀(尽管这种测量形式不是很准确的开始)

#include <string.h>
#include <time.h>
#include <stdio.h>
#define SIZE 100000000

char c[SIZE];
char c2[SIZE];

int main()
{
   int i;
   for(i = 0; i < SIZE; i+=4096)      ////  access and modify each page once
       c[i] = 0;                      ////

   clock_t t = clock();

   for(i = 0; i < SIZE; i++)
       c[i] = 0;

   t = clock() - t;
   printf("%d\n\n", t);

   t = clock(); 
   for(i = SIZE - 1; i >= 0; i--)
      c[i] = 0;

   t = clock() - t;
   printf("%d\n\n", t);
}

If you modify the second loop to be identical to the first the effect is the same, the second loop is faster: 如果将第二个循环修改为与第一个相同,则效果相同,则第二个循环更快:

int main() {
   int i;

   clock_t t = clock();
   for(i = 0; i < SIZE; i++)
       c[i] = 0;
   t = clock() - t;
   printf("%d\n\n", t);

   t = clock(); 
   for(i = 0; i < SIZE; i++)
      c[i] = 0;
   t = clock() - t;
   printf("%d\n\n", t);
}

This is due to the first loop loading the information into the cache and that information being readily accessible during the second loop 这是由于第一个循环将信息加载到缓存中,并且该信息在第二个循环中很容易访问

Results of the above: 以上结果:

317841
277270

Edit : Leeor brings up a good point, c does not fit in the cache. 编辑 :Leeor提出了一个很好的观点, c不适合缓存。 I have an Intel Core i7 processor: http://ark.intel.com/products/37147/Intel-Core-i7-920-Processor-8M-Cache-2_66-GHz-4_80-GTs-Intel-QPI 我有一个Intel Core i7处理器: http : //ark.intel.com/products/37147/Intel-Core-i7-920-Processor-8M-Cache-2_66-GHz-4_80-GTs-Intel-QPI

According to the link, this means the L3 cache is only 8 MB, or 8,388,608 bytes and c is 100,000,000 bytes 根据链接,这意味着L3缓存只有8 MB,即8,388,608字节,而c是100,000,000字节

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM