两个 16 位 integer 向量与 C++ 中的 AVX2 的内积

Question

I am searching for the most efficient way to multiply two aligned int16_t arrays whose length can be divided by 16 with AVX2.我正在寻找将两个对齐的 int16_t arrays 相乘的最有效方法，其长度可以用 AVX2 除以 16。

After multiplication into a vector x I started with _mm256_extracti128_si256 and _mm256_castsi256_si128 to have the low and high part of x and added them with _mm_add_epi16 .在乘以向量x后，我从_mm256_extracti128_si256和_mm256_castsi256_si128开始，得到x的低和高部分，并用_mm_add_epi16将它们相加。

I copied the result register and applied _mm_move_epi64 to the original register and added both again with _mm_add_epi16 .我复制了结果寄存器并将_mm_move_epi64应用于原始寄存器，并使用_mm_add_epi16再次添加两者。 Now, I think that I have:现在，我认为我有：
-, -, -, -, x15+x7+x11+x3, x14+x6+x10+x2, x13+x5+x9+x1, x12+x4+x8+x0 within the 128bit register. -, -, -, -, x15+x7+x11+x3, x14+x6+x10+x2, x13+x5+x9+x1, x12+x4+x8+x0在 128 位寄存器内。 But now I am stuck and don't know how to efficiently sum up the remaining four entries and how to extract the 16bit result.但是现在我被卡住了，不知道如何有效地总结剩余的四个条目以及如何提取 16 位结果。

Answer 1

Following the comments and hours of google my working solution:按照谷歌的评论和小时我的工作解决方案：

// AVX multiply
hash = 1;
start1 = std::chrono::high_resolution_clock::now();
for(int i=0; i<2000000; i++) {
    ZTYPE*  xv = al_entr1.c.data();
    ZTYPE*  yv = al_entr2.c.data();

    __m256i tres = _mm256_setzero_si256();
    for(int ii=0; ii < MAX_SIEVING_DIM; ii = ii+16/*8*/)
    {
        // editor's note: alignment required.  Use loadu for unaligned
        __m256i  xr = _mm256_load_si256((__m256i*)(xv+ii));
        __m256i  yr = _mm256_load_si256((__m256i*)(yv+ii));
        const __m256i tmp = _mm256_madd_epi16 (xr, yr);
        tres =  _mm256_add_epi32(tmp, tres);
    }

    // Reduction
    const __m128i x128 = _mm_add_epi32  ( _mm256_extracti128_si256(tres, 1), _mm256_castsi256_si128(tres));
    const __m128i x128_up = _mm_shuffle_epi32(x128, 78);
    const __m128i x64  = _mm_add_epi32  (x128, x128_up);
    const __m128i _x32 =  _mm_hadd_epi32(x64, x64);

    const int res = _mm_extract_epi32(_x32, 0);
    hash |= res;
}

finish1 = std::chrono::high_resolution_clock::now();
elapsed1 = finish1 - start1;
std::cout << "AVX multiply: " <<elapsed1.count() << " sec. (" << hash << ")" << std::endl;

It is at least the fastest solution so far:它至少是迄今为止最快的解决方案：

std::inner_product: 0.819781 sec. std::inner_product: 0.819781 秒。 (-14335) (-14335)
std::inner_product (aligned): 0.964058 sec. std::inner_product（对齐）：0.964058 秒。 (-14335) (-14335)
naive multiply: 0.588623 sec.天真的乘法：0.588623 秒。 (-14335) (-14335)
Unroll multiply: 0.505639 sec.展开乘法：0.505639 秒。 (-14335) (-14335)
AVX multiply: 0.0488352 sec. AVX 乘法：0.0488352 秒。 (-14335) (-14335)

两个 16 位 integer 向量与 C++ 中的 AVX2 的内积

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-04 08:53:42

两个 16 位 integer 向量与 C++ 中的 AVX2 的内积

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-04 08:53:42

解决方案1
1 已采纳 2020-06-04 08:53:42