为什么阻塞技术会减少分支指令？

Question

I profiled Blocking matrix multiplication as the size of block increased Number of branch instruction decreased. 随着块大小的增加，我分析了块矩阵乘法。分支指令的数量减少了。 as in Image1 boxed group has 4.5 million Branch instruction but in other groups it is about 17 million Branch instruction this is in case that only order of loops have changed. 因为在Image1中，装箱的组有450万条分支指令，但是在其他组中，大约只有1700万条分支指令，这是在仅循环顺序已更改的情况下。 As far as I know branch instruction depend on any branch instruction (conditional or unconditional) used in code or as in its machine code but I can't figure it out how loop reordering can change the amount of branching. 据我所知，分支指令取决于代码或其机器代码中使用的任何分支指令（有条件的或无条件的），但我无法弄清楚循环重排序如何改变分支的数量。 despite loop reordring blocking technique can also affect number of Branch Instruction. 尽管使用了循环协调阻塞技术，但也会影响分支指令的数量。

OS is linux x86_64 Ram 4G l1 cache 32k 64Byte line size L2 cache 2048k 64Byte line size 4-way associative. OS是linux x86_64 Ram 4G l1高速缓存32k 64Byte行大小L2高速缓存2048k 64Byte行大小4路关联。 profile with papi_library 配置文件与papi_library

kij algorithm kij算法

For (k=0;k<n;k++)
For(i=0;i<n;i++){
    r=A[i][k];
  For (j=0;j<n;j++)
      C[i][j]+=r*B[k][j] 
}

ikj algorithm ikj算法

For (i=0;i<n;i++)
 For(k=0;k<n;k++){
  r=A[i][k];
  For (j=0;j<n;j++)
       C[i][j]+=r*B[k][j] 
}

my blocking code is not at hand but use 1 level of blocking. 我的阻止代码不是立即可用，而是使用1级阻止。

Image 1 (chart are scaled logarithmic and may be all groups looks like the same but the values are true) 图像1（图表按对数比例缩放，可能所有组看起来都一样，但值是正确的）

在此处输入图片说明

Questions : 问题：

1- why loop reordering or blocking can decrease or Increase amount of Branch Instruction? 1-为什么循环重新排序或阻塞可以减少或增加分支指令的数量？

thanks 谢谢

Answer 1

Loop reordering, which is one of code-block reordering optimizations, alters the order of the basic blocks in a program in order to reduce conditional branches and improve locality of reference . 循环重新排序是代码块重新排序的优化之一，它改变了程序中基本块的顺序，以减少条件分支并改善引用的局部性。

To describe branch reduction simply, let's say you have a code like this: 为了简单地描述分支减少，假设您有如下代码：

void foo(bool is_enabled) {
  for (int i = 0; i < 10000; ++i) {
    if (is_enabled) {
      data[i].enable();
    } else {
      data[i].disable();
    }
  }
}

Given that there is no need to check is_enabled all the time, what compiler might decide to do is this: 鉴于不需要一直检查is_enabled ，编译器可能会决定这样做：

void foo(bool is_enabled) {
  if (is_enabled) {
    for (int i = 0; i < 10000; ++i) {
      data[i].enable();
    }
  } else {
    for (int i = 0; i < 10000; ++i) {
      data[i].disable();
    }
  }
}

... thus reducing a number of branches by 9999 (only one check for is_enabled instead of 10000). ...因此减少了9999个分支（仅对is_enabled一次检查，而不是10000个检查）。

In the code snippet you have, this is more a locality of reference optimization to play nicely with memory pre-fetcher and CPU caches, due to a more hardware friendly memory access pattern. 在您的代码段中，由于硬件更友好的内存访问模式，这是参考优化的一个局部，可以很好地与内存预取器和CPU缓存配合使用。

Answer 2

I don't think loop reordering would affect the number of branch instructions generated for your sample code because it has no conditional test inside the loop and all the loops are the same length. 我认为循环重排序不会影响为您的示例代码生成的分支指令的数量，因为它在循环内部没有条件测试，并且所有循环的长度相同。

If the block size is known at compile time, your compiler might be unrolling the loop for each block. 如果在编译时知道块大小，则您的编译器可能正在展开每个块的循环。

You should really look at the assembly output from your compiler. 您应该真正查看编译器的程序集输出。

为什么阻塞技术会减少分支指令？

问题描述

2 个解决方案

解决方案1
1

解决方案2
0 2013-10-14 15:20:37

为什么阻塞技术会减少分支指令？

问题描述

2 个解决方案

解决方案1 1

解决方案2 0 2013-10-14 15:20:37

解决方案1
1

解决方案2
0 2013-10-14 15:20:37