[英]How to optimize reversing the order of groups of bits
基本上,我有8個數據,每個2位(4個狀態),存儲在32位整數的16個LSB中。 我想反轉數據片段的順序以進行一些模式匹配。
給了我一個參考整數和8個候選對象,我需要將一個候選對象與該參考匹配。 但是,可以以某種可預測的方式對匹配候選進行轉換。
如果參考數據的格式為[0,1,2,3,4,5,6,7],則可能的匹配可以是以下8種格式之一:
[0,1,2,3,4,5,6,7], [0,7,6,5,4,3,2,1]
[6,7,0,1,2,3,4,5], [2,1,0,7,6,5,4,3]
[4,5,6,7,0,1,2,3], [4,3,2,1,0,7,6,5]
[2,3,4,5,6,7,0,1], [6,5,4,3,2,1,0,7]
模式是數據始終是有序的,但是可以反轉和旋轉。
我正在C和MIPS中實現這一點。 我倆都在工作,但它們看起來很笨重。 我當前的方法是從原始文件中遮蓋住每個片段,將其移至新位置,然后將其與新變量(初始化為0)進行“或”運算。
我在C語言中做了更多的硬編碼:
int ref = 4941; // reference value, original order [1,3,0,1,3,0,1,0], (encoded as 0b0001001101001101)
int rev = 0;
rev |= ((ref & 0x0003) << 14) | ((ref & 0x000C) << 10) | ((ref & 0x0030) << 6) | ((ref & 0x00C0) << 2); // move bottom 8 bits to top
rev |= ((ref & 0xC000) >> 14) | ((ref & 0x3000) >> 10) | ((ref & 0x0C00) >> 6) | ((ref & 0x0300) >> 2); // move top 8 bits to bottom
// rev = 29124 reversed order [0,1,0,3,1,0,3,1], (0b0111000111000100)
我在MIPS中實現了一個循環,以嘗試減少靜態指令:
lw $01, Reference($00) # load reference value
addi $04, $00, 4 # initialize $04 as Loop counter
addi $05, $00, 14 # initialize $05 to hold shift value
addi $06, $00, 3 # initialize $06 to hold mask (one piece of data)
# Reverse the order of data in Reference and store it in $02
Loop: addi $04, $04, -1 # decrement Loop counter
and $03, $01, $06 # mask out one piece ($03 = Reference & $06)
sllv $03, $03, $05 # shift piece to new position ($03 <<= $05)
or $02, $02, $03 # put piece into $02 ($02 |= $03)
sllv $06, $06, $05 # shift mask for next piece
and $03, $01, $06 # mask out next piece (#03 = Reference & $06)
srlv $03, $03, $05 # shift piece to new position ($03 >>= $05)
or $02, $02, $03 # put new piece into $02 ($02 |= $03)
srlv $06, $06, $05 # shift mask back
addi $05, $05, -4 # decrease shift amount by 4
sll $06, $06, 2 # shift mask for next loop
bne $04, $00, Loop # keep looping while $04 != 0
有沒有一種方法可以實現更簡單或至少更少的指令?
對於一種非常簡單有效的方法,請使用256字節的查找表並執行2次查找:
extern unsigned char const xtable[256];
unsigned int ref = 4149;
unsigned int rev = (xtable[ref & 0xFF] << 8) | xtable[ref >> 8];
可以通過一組宏對xtable
數組進行靜態初始化:
#define S(x) ((((x) & 0x0003) << 14) | (((x) & 0x000C) << 10) | \
(((x) & 0x0030) << 6) | (((x) & 0x00C0) << 2) | \
(((x) & 0xC000) >> 14) | (((x) & 0x3000) >> 10) | \
(((x) & 0x0C00) >> 6) | (((x) & 0x0300) >> 2))
#define X8(m,n) m((n)+0), m((n)+1), m((n)+2), m((n)+3), \
m((n)+4), m((n)+5), m((n)+6), m((n)+7)
#define X32(m,n) X8(m,(n)), X8(m,(n)+8), X8(m,(n)+16), X8(m,(n)+24)
unsigned char const xtable[256] = {
X32(S, 0), X32(S, 32), X32(S, 64), X32(S, 96),
X32(S, 128), X32(S, 160), X32(S, 192), X32(S, 224),
};
#undef S
#undef X8
#undef X32
如果空間並不昂貴,則可以使用單個查找到128K字節的表中,該表可以在啟動時計算,也可以在編寫時使用腳本生成並在編譯時包含,但是這樣做有些浪費,而且不適合緩存。
要反轉位,可以使用以下代碼。
static int rev(int v){
// swap adjacent pairs of bits
v = ((v >> 2) & 0x3333) | ((v & 0x3333) << 2);
// swap nibbles
v = ((v >> 4) & 0x0f0f) | ((v & 0x0f0f) << 4);
// swap bytes
v = ((v >> 8) & 0x00ff) | ((v & 0x00ff) << 8);
return v;
}
MIPS的實現是15條指令。
rev: # value to reverse in $01
# uses $02 reg
srli $02, $01, 2
andi $02, $02, 0x3333
andi $01, $01, 0x3333
slli $01, $01, 2
or $01, $01, $02
srli $02, $01, 4
andi $02, $02, 0x0f0f
andi $01, $01, 0x0f0f
slli $01, $01, 4
or $01, $01, $02
srli $02, $01, 8
andi $02, $02, 0xff
andi $01, $01, 0xff
slli $01, $01, 8
or $01, $01, $02
# result in $01
請注意,只需將常量加倍(在64位計算機上甚至為4),即可同時反轉2x16位。 但是我不確定它是否對您有用。
注意: 注意手寫優化的程序集,如果確實在緊密的循環中遇到了編譯器問題,那么確實有特定於處理器的優化可以保留它們 。
您可以改善管道 ,(如果您使用C語言進行編碼,則編譯器會為您完成),並使用bne
指令的延遲槽。 這將改善您的指令級並行性 。
假設您有一個帶1個延遲插槽和5級流水線的Mips處理器(指令獲取,解碼,執行,內存,寫回)。
該流水線介紹了對數據依存性的讀后寫危害,大多數都在$3
寄存器上。
RaW hasard會導致您的管道停頓。
# Reverse the order of data in Reference and store it in $02
Loop: and $03, $01, $06 # mask out one piece ($03 = Reference & $06)
addi $04, $04, -1 # decrement Loop counter (RaW on $3)
sllv $03, $03, $05 # shift piece to new position ($03 <<= $05)
sllv $06, $06, $05 # shift mask for next piece
or $02, $02, $03 # put piece into $02 ($02 |= $03)
and $03, $01, $06 # mask out next piece (#03 = Reference & $06)
srlv $06, $06, $05 # shift mask back
srlv $03, $03, $05 # shift piece to new position ($03 >>= $05)
addi $05, $05, -4 # decrease shift amount by 4
or $02, $02, $03 # put new piece into $02 ($02 |= $03)
bne $04, $00, Loop # keep looping while $04 != 0
sll $06, $06, 2 # shift mask for next loop
如果您擁有Superscalar處理器,則該解決方案需要進行一些更改。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.