从uint32_t [16]数组复制到uint32_t变量序列的64位

Question

I have been able to use a 64-bit copy on equal sized uint32_t arrays for performance gain and wanted to do the same to a sequence of 16 uint32_t variables, from a uint32_t[16] array. 我已经能够在相等大小的uint32_t数组上使用64位副本以提高性能，并希望对uint32_t [16]数组中的16个uint32_t变量序列执行相同的操作。 I am unable to substitute to the variables with an array as it causes performance regression. 我无法用数组替换变量，因为它会导致性能下降。

I noticed the compiler gives pointer addresses in sequence to a series of declared uint32_t variables, in reverse that is the last variable gets the lowest address and increments up by 4 bytes to the first declared variable. 我注意到编译器按顺序将指针地址提供给一系列声明的uint32_t变量，相反，最后一个变量获得最低地址，并向第一个声明的变量递增4个字节。 I tried to use the start destination address of the that final variable and cast it into a uint64_t * pointer but this did not work. 我试图使用该最终变量的起始目标地址，并将其转换为uint64_t *指针，但这没有用。 Pointers for the uint32_t[16] array however are in sequence. 但是，uint32_t [16]数组的指针是按顺序排列的。

Here is an example of my most recent attempt. 这是我最近尝试的一个示例。

uint32_t x00,x01,x02,x03,x04,x05,x06,x07,x08,x09,x10,x11,x12,x13,x14,x15;
uint64_t *Bu64ptr = (uint64_t *) B;
uint64_t *x15u64ptr = (uint64_t *) &x15;

/* This is an inline function that does 64-bit eqxor on two uint32_t[16] 
& stores the results in uint32_t B[16]*/
salsa8eqxorload64(B,Bx);

/* Trying to 64-bit copy here */
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;

Am I pursuing the impossible or is my lack of skill getting in the way again? 我是在追寻不可能的事，还是我的技能不足再次阻碍？ I checked the pointer address value of x15 and x15u64ptr and they are completely different, using the method below. 我使用以下方法检查了x15和x15u64ptr的指针地址值，它们完全不同。

printf("x15u64ptr %p\n", (void *) x15u64ptr);
printf("x15 %p\n", (void *) &x15);

I had one idea to create an array, and use the x?? 我有一个想法来创建一个数组，并使用x ?? variables as pointers to the individual elements in the array and then perform the 64-bit copy on both arrays which I hoped would assign the values to the uint32_t variables in that way but got compiler failure warning about invalid ivalue for the = assignment. 变量作为指向数组中各个元素的指针，然后在两个数组上执行64位复制，我希望以此方式将值分配给uint32_t变量，但收到有关=赋值无效ivalue的编译器失败警告。 Maybe I am doing something wrong in the syntax. 也许我在语法上做错了。 Using 64-bit memcpy alternatives and custom 64-bit eqxor I have increased the performance of the hashing function by over 10% and expect this to give another 5-10% improvement, if I can only get it to work. 使用64位memcpy替代方案和自定义64位eqxor，我将哈希函数的性能提高了10％以上，并希望如果能正常使用，它还会再提高5-10％。

*UPDATE 13-09-2018 *更新13-09-2018

I ended using a struct then a neon based operation. 我结束了使用结构，然后使用了基于霓虹灯的操作。 20% better performance to the original using 32-bit code and memcpy. 使用32位代码和memcpy，性能比原始性能提高20％。 I was also able to extend technique to add&save and eqxor operations that salsa20/8 uses. 我还能够扩展技术来添加和保存salsa20 / 8使用的eqxor操作。

struct XX
{
uint32_t x00, x01, x02, x03, x04, x05, x06, x07, x08, x09, x10, x11, x12,x13,x14,x15;
} X;

//dst & src must be uint32_t[32]. Note only 8 operations, to account for "128-bit" though neon really only does 64-bit at a time.
static inline void memcpy128neon(uint32_t * __restrict dst, uint32_t * __restrict src)
{
uint32x4_t *s1 = (uint32x4_t *) dst;
uint32x4_t *s2 = (uint32x4_t *) src;

*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;
}

Then invoke like this... memcpy128neon(&X.x00,arr); 然后像这样调用... memcpy128neon（＆X.x00，arr）;

Update 16-10-2018 If found this macro which allows Union Casting... 更新16-10-2018如果找到了允许联合铸造的宏...

#define UNION_CAST(x, destType) \
   (((union {__typeof__(x) a; destType b;})x).b)

Here is an example of creating a 1024-bit pointer using a custom type based on Arm's neon uint32x4_t vector for an array with 8 indexes, but any datatype can be used. 这是使用基于Arm的neon uint32x4_t向量的自定义类型为具有8个索引的数组创建自定义类型来创建1024位指针的示例，但是可以使用任何数据类型。 This makes the casting compliant with strict aliasing. 这使转换符合严格的别名。

uint32x4x8_t *pointer = (uint32x4x8_t *) UNION_CAST(originalpointer, uint32x4x8_t *);

Answer 1

There is no guarantee that the variables will be placed in the memory at the order in declaration. 无法保证将变量按照声明中的顺序放置在内存中。

I would use union punning myself. 我自己会用工会修剪。

#include <stdio.h>
#include <stdint.h>
#include <string.h>

#define SOMETHING   (uint64_t *)0x12345676   // only
#define LITTLEENDIAN 1

typedef union
{
    uint32_t u32[2];
    uint64_t u64;
}data_64;

int main()
{
    uint64_t *Bu64ptr = SOMETHING;

    data_64 mydata[10];

    //you can copy memory
    memcpy(mydata, Bu64ptr, sizeof(mydata));

    //or just loop
    for(size_t index = 0; index < sizeof(mydata) / sizeof(mydata[0]); index++)
    {
        mydata[index].u64 = *Bu64ptr++;
    }

    for(size_t index = 0; index < sizeof(mydata) / sizeof(mydata[0]); index++)
    {   
        printf("Lower word = %x, Upper word = %x\n", mydata[!LITTLEENDIAN], mydata[LITTLEENDIAN]);
    }    

    return 0;
}

It will work exactly the same way in the opposite direction 它将在相反的方向完全相同地工作

从uint32_t [16]数组复制到uint32_t变量序列的64位

问题描述

1 个解决方案

解决方案1
2 2018-09-06 22:38:48

从uint32_t [16]数组复制到uint32_t变量序列的64位

问题描述

1 个解决方案

解决方案1 2 2018-09-06 22:38:48

解决方案1
2 2018-09-06 22:38:48