简体   繁体   English

操纵 mips 汇编代码以降低缓存未命中率(火星模拟器)

[英]manipulating mips assembly code to decrease cache miss rate (mars simulator)

How could an assembly code be optimised to decrease the miss rate of the cache?如何优化汇编代码以降低缓存的未命中率? I am aware that changing the placement policy/block size/block replacement policy has effects on the cache miss rate, but I am specifically asking for the settings on mars to be fixed.我知道更改放置策略/块大小/块替换策略会影响缓存未命中率,但我特别要求修复 mars 上的设置。

We are currently asked to work with direct mapping, block size words 2 and cache size 128 bytes and instead focus on the code to improve the miss rate.我们目前被要求使用直接映射、块大小字 2 和缓存大小 128 字节,而是专注于代码以提高未命中率。

My data segment is this currently:我的数据段目前是这样的:

.data                                           #data segment
    X: .word 1,1,1,2,2,3,4,2,1,1,2,3,2,4,8,1,1,1,1,2,2,3,4,2,1,1,2,3,2,4,8,1    #X[]
    H: .word 1,5,1,4,2,1,1,1,1,5,1,4,2,1,1,1,1,5,1,4,2,1,1,1                    #H[]
    Y: .word 0:8                                #Y[]

    m: .word 8                                  #m = 8
    n: .word 24                                 #n = 24

I noticed that moving the X[] array under n decreased the miss rate from 53% to 44%.我注意到将X[]数组移动到n下会将未命中率从 53% 降低到 44%。 Could I manipulate my code further to decrease the miss rate?我可以进一步操纵我的代码以降低未命中率吗? Can it only be done by manipulating the data segment?只能通过操作数据段来完成吗? Why does the hit rate improve when I move the X array a few lines down?为什么当我将X数组向下移动几行时,命中率会提高?

I would appreciate a more "Why this happens and how to take advantage of it" kind of explanation, but I will leave the entirety of my code for the sake of examples.我希望得到更多“为什么会发生这种情况以及如何利用它”的解释,但为了举例,我将保留我的全部代码。

    .data                                           #data segment

    H: .word 1,5,1,4,2,1,1,1,1,5,1,4,2,1,1,1,1,5,1,4,2,1,1,1                    #H[]
    n: .word 24                                 #n = 24
    Y: .word 0:8                                #Y[]

    m: .word 8                                  #m = 8

    X: .word 1,1,1,2,2,3,4,2,1,1,2,3,2,4,8,1,1,1,1,2,2,3,4,2,1,1,2,3,2,4,8,1    #X[]

.text
fir:                                            #void fir
    la $t0, X                             #address of X[]  
    la $t1, H                              #address of H[]
    la $t2, Y                              #address of Y[]
    
    
    lw $t3, n
    lw $t4, m

    add $t5,$zero,$zero                              #set j counter to 0
    addi $s4,$zero,4   
                             

    j for1                                      #go to for1 loop

for1:   
    beq $t5,$t4,exit                              #when j is equal to m loop stops and goes to the exit function
    add $t6,$zero,$zero                               #y0 = r6 = 0
    add $t7,$zero,$zero                               #set i counter to 0 for every loop

    bne $t5,$t4,for2                              #go to the nested loop for2

for2:
    beq $t7,$t3, afterfor2                        # for(i=0, i<n, i++)

    #making X[i+j]
    add $t8,$t5,$t7                               #i+j stored in r8
    mul $t9,$t8,$s4                              #(i+j)*8 to find address stored in r9
    addu $t9,$t9,$t0
    lw $s0,($t9)                                 #load x[i+j] to $s0

    #making H[i]
    mul $s1,$s4,$t7                             #8*i
    addu $s1,$s1,$t1                            #adding 8*i to the base address
    lw $s2,($s1)                                #load h[i] to $s2

    mul $s3,$s0,$s2                            #x[i+j] * h[i]
    add $t6,$t6,$s3                              #y0 = y0 + x[i+j]*h[i]
    addi $t7,$t7,1                               # i++
    j for2                                      # repeat loop

afterfor2: 
        
    mul $s5,$t5,$s4                             #8*j stored to $s5
    addu $s5,$s5,$t2
    sw $t6,($s5)                                 #Y[j] = y0
    addi $t5,$t5,1                               #j++
    
    j for1                                      #go back to start of loop for1


exit:
    li $v0, 10
    syscall                                       #exit program

Are there any commands (for example sw or la ) that could change in order to improve the hit rate or does it only depend on the data segment?是否有任何命令(例如swla )可以更改以提高命中率,还是仅取决于数据段?

The hit rate depends on both data placement and the access pattern by the code.命中率取决于代码的数据放置和访问模式。

In a direct mapped cache, there's only one place to cache each byte of memory, and, that place is determined by the byte's memory address.在直接映射缓存中,只有一个位置可以缓存 memory 的每个字节,并且该位置由字节的 memory 地址决定。

Specifically, an address breaks down into fields as follows:具体来说,地址分解为如下字段:

    +--------+-------+--------+
    |  tag   | index | offset |  field name
    +--------+-------+--------+
        25       3       4       field size

There are 8 blocks/lines, so the index size is 3 bits (8 values).有 8 个块/行,因此索引大小为 3 位(8 个值)。 There are 4 words per block/line, which is 16 bytes, so the offset is 4 bits (16 values).每块/行有 4 个字,即 16 个字节,因此偏移量为 4 位(16 个值)。

When two close memory variables, by their addresses, have the same index & tag then, when they are alternately accessed, the cache performs well.当两个关闭的 memory 变量,通过它们的地址,具有相同的indextag ,那么当它们交替访问时,缓存表现良好。

When two distant memory variables, by their addresses, have the same index (but have different tag ), then when they are alternately accessed, the direct mapped cache must evict one in order to cache the other.当两个遥远的 memory 变量,通过它们的地址,具有相同的index (但具有不同的tag ),那么当它们被交替访问时,直接映射缓存必须驱逐一个以缓存另一个。

If the program's access pattern alternates between two distant memory variables, then it will thrash the cache.如果程序的访问模式在两个遥远的 memory 变量之间交替,那么它将破坏缓存。

For example, let's focus on a single cache block, the first one at index 0:例如,让我们关注单个缓存块,索引 0 处的第一个缓存块:

Let's say that we access location 0x10010000.假设我们访问位置 0x10010000。 That will be cached in block/line 0 because the index field for that address is 000 2 .这将被缓存在块/行 0 中,因为该地址的索引字段是 000 2 Next we access location 0x10010080.接下来我们访问位置 0x10010080。 This is a different address but also resolves to index 0. The direct mapped cache has no choice but to evict address 0x10010000 from block 0, so that it can cache address 0x10010080 there.这是一个不同的地址,但也解析为索引 0。直接映射缓存别无选择,只能将地址 0x10010000 从块 0 中逐出,以便将地址 0x10010080 缓存在那里。 If the program next accesses 0x10010000 again, as that is no longer cached, then that is a miss, which has the side effect of evicting 0x10010080 from block 0. Each time the program goes between those two addresses, there's a miss.如果程序接下来再次访问 0x10010000,因为它不再被缓存,那么这是一次未命中,其副作用是将 0x10010080 从块 0 中逐出。每次程序在这两个地址之间移动时,都会发生一次未命中。


Moving one array relative to the other changes which elements of the two arrays share the same index and thus collide in the cache.相对于另一个数组移动一个数组会改变两个 arrays 的哪些元素共享相同的索引,从而在缓存中发生冲突。

Changing the size of the arrays would also have an effect: a 24 word array has 96 bytes, which is a good chunk of that cache.更改 arrays 的大小也会产生影响:24 字数组有 96 个字节,这是该缓存的一大块。

Changing the algorithm would change the access pattern and thus also have an effect.改变算法会改变访问模式,因此也会产生影响。 Sometimes we can change an algorithm to work on smaller parts of the array at a time, and that can result in less conflict.有时我们可以更改算法以一次处理数组的较小部分,这样可以减少冲突。

Changing the cache design can dramatically improve thrashing at the cost of more hardware.更改缓存设计可以以更多硬件为代价显着改善抖动。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM