[英]Inner product of two 16bit integer vectors with AVX2 in C++
I am searching for the most efficient way to multiply two aligned int16_t arrays whose length can be divided by 16 with AVX2.我正在寻找将两个对齐的 int16_t arrays 相乘的最有效方法,其长度可以用 AVX2 除以 16。
After multiplication into a vector x
I started with _mm256_extracti128_si256
and _mm256_castsi256_si128
to have the low and high part of x
and added them with _mm_add_epi16
.在乘以向量
x
后,我从_mm256_extracti128_si256
和_mm256_castsi256_si128
开始,得到x
的低和高部分,并用_mm_add_epi16
将它们相加。
I copied the result register and applied _mm_move_epi64
to the original register and added both again with _mm_add_epi16
.我复制了结果寄存器并将
_mm_move_epi64
应用于原始寄存器,并使用_mm_add_epi16
再次添加两者。 Now, I think that I have:现在,我认为我有:
-, -, -, -, x15+x7+x11+x3, x14+x6+x10+x2, x13+x5+x9+x1, x12+x4+x8+x0
within the 128bit register. -, -, -, -, x15+x7+x11+x3, x14+x6+x10+x2, x13+x5+x9+x1, x12+x4+x8+x0
在 128 位寄存器内。 But now I am stuck and don't know how to efficiently sum up the remaining four entries and how to extract the 16bit result.但是现在我被卡住了,不知道如何有效地总结剩余的四个条目以及如何提取 16 位结果。
Following the comments and hours of google my working solution:按照谷歌的评论和小时我的工作解决方案:
// AVX multiply
hash = 1;
start1 = std::chrono::high_resolution_clock::now();
for(int i=0; i<2000000; i++) {
ZTYPE* xv = al_entr1.c.data();
ZTYPE* yv = al_entr2.c.data();
__m256i tres = _mm256_setzero_si256();
for(int ii=0; ii < MAX_SIEVING_DIM; ii = ii+16/*8*/)
{
// editor's note: alignment required. Use loadu for unaligned
__m256i xr = _mm256_load_si256((__m256i*)(xv+ii));
__m256i yr = _mm256_load_si256((__m256i*)(yv+ii));
const __m256i tmp = _mm256_madd_epi16 (xr, yr);
tres = _mm256_add_epi32(tmp, tres);
}
// Reduction
const __m128i x128 = _mm_add_epi32 ( _mm256_extracti128_si256(tres, 1), _mm256_castsi256_si128(tres));
const __m128i x128_up = _mm_shuffle_epi32(x128, 78);
const __m128i x64 = _mm_add_epi32 (x128, x128_up);
const __m128i _x32 = _mm_hadd_epi32(x64, x64);
const int res = _mm_extract_epi32(_x32, 0);
hash |= res;
}
finish1 = std::chrono::high_resolution_clock::now();
elapsed1 = finish1 - start1;
std::cout << "AVX multiply: " <<elapsed1.count() << " sec. (" << hash << ")" << std::endl;
It is at least the fastest solution so far:它至少是迄今为止最快的解决方案:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.