简体   繁体   中英

What is the fastest way to index into ARMv8 registers

The ARMv8 instruction set allows access to any integer register built into an instruction, as in:

add x0, x1, x2  @ x0 = x1 + x2, 64 bit arithmetic

However, is there any way to load a register from 0 to 15, for example, using a value in a register?

For example, suppose register x16 contains the number 5. In that case, I want x5.

This can be accomplished in memory of course (an array), but that's much slower.

ldr x19, [x17, x16, lsl #3]

where x17 is some base address, and x16 is the index, but this requires going to memory. if cached, this is slower. If writing back to the value, the write through will presumably take more time.

The only other way I can think of doing this is some kind of computed goto:

    add x18, x18, x16, lsl #6
    bx  x18
1:
    mov x19, x0
    ...

2:
    mov x19, x1
    ...

3:
    mov x19, x2
    ...

And that would be even slower than the array access.

Ideally there would be an indexing mode like:

mov x19, x[x16]

As noted in the comments, it is often faster to work with an array in memory to do this for smaller datasets. On ARM there is also the possibility with table lookup instructions to do this a little more efficiently for larger amounts of data:

Up to four 16-byte SIMD registers can be transferred to the tbl instruction. For each of the 16 bytes of an entry, the value is taken from the partial register with the corresponding number, otherwise zero (the similar instruction tbx , however, leaves the value unchanged). An example:

input:  v0 = [0x00, 0x01, 0x08, 0x10, 0x12, 0x20, 0x21, 0x30, 0x3F, 0x40, ...]
tables: v4 = [0x40, 0x41, 0x42, ..., 0x4F]
        v5 = [0x50, 0x51, 0x52, ..., 0x5F]
        v6 = [0x60, 0x61, 0x62, ..., 0x6F]
        v7 = [0x70, 0x71, 0x72, ..., 0x7F]

Executing tbl v1.16b, {v4.16b, v5.16b, v6.16b, v7.16b}, v0.16b gives the following:

output: v1 = [0x40, 0x41, 0x48, 0x50, 0x52, 0x60, 0x61, 0x70, 0x7F, 0x00, ...]

Using tbx all values greater than 0x3F would be ingored instead of zeroed:

output: v1 = [0x40, 0x41, 0x48, 0x50, 0x52, 0x60, 0x61, 0x70, 0x7F, 0x40, ...]

How to use this to index into registers?

Since only a byte-wise lookup is possible, some preliminary work is necessary: The index from the general-purpose register is transferred to a SIMD register and additionally to a second one so that it can be adapted to both registers.

input:                x0 = [index, 0, 0, ..., 0]
first  SIMD register: v0 = [index*8, index*8+1, ..., index*8+7, 0, 0, ..., 0]
second SIMD register: v1 = [index*8-64, index*8-63, ..., index*8-57, 0, 0, ..., 0]

This is to meet the fact that the lookup value must always be in between 0 and 15 (or 31, 47 or 63) and the lookup should be done on eight consecutive bytes here.

The index is therefore converted to a position in each lookup table (each tbl instruction has one). If it is out of range, tbl will deliver zero and will have no effect if the result is orr -ed together at the end.


Working example:

The following data needs to be defined:

modifier: .byte 0, 1, 2, 3, 4, 5, 6, 7, -64, -63, -62, -61, -60, -59, -58, -57

The input value is in x0 . The values for the lookup are either taken from the lookup_table memory location. The result is stored in x0 :

// Load lookup table from memory
adr  x1, lookup_table
ldp  q8, q9, [x1]
ldp  q10, q11, [x1, 32]
ldp  q12, q13, [x1, 64]
ldp  q14, q15, [x1, 96]

// Take value to be looked up from general-purpose register
dup  v0.8b, w0

// Prepare index before lookup
adr  x1, modifier
ldp  d2, d3, [x1]
shl  v0.8b, v0.8b, 3
add  v2.8b, v0.8b, v2.8b
add  v3.8b, v0.8b, v3.8b

// Do Lookup
tbl  v2.8b, {v8.16b,  v9.16b,  v10.16b, v11.16b}, v0.8b
tbl  v3.8b, {v12.16b, v13.16b, v14.16b, v15.16b}, v1.8b
orr  v0.8b, v2.8b, v3.8b

// Load the result back into a general-purpose register
umov x0, v0.2d[0]

If there really is no other way, the values can also be taken from the general-purpose registers x8 to x23 :

ins   v8.2d[0], x8
ins   v9.2d[0], x10
ins  v10.2d[0], x12
//   ...
ins  v15.2d[0], x22
ins   v8.2d[1], x9
ins   v9.2d[1], x11
ins  v10.2d[1], x13
//   ...
ins  v15.2d[1], x23

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM