简体   繁体   English

两个 16 位 integer 向量与 C++ 中的 AVX2 的内积

[英]Inner product of two 16bit integer vectors with AVX2 in C++

I am searching for the most efficient way to multiply two aligned int16_t arrays whose length can be divided by 16 with AVX2.我正在寻找将两个对齐的 int16_t arrays 相乘的最有效方法,其长度可以用 AVX2 除以 16。

After multiplication into a vector x I started with _mm256_extracti128_si256 and _mm256_castsi256_si128 to have the low and high part of x and added them with _mm_add_epi16 .在乘以向量x后,我从_mm256_extracti128_si256_mm256_castsi256_si128开始,得到x的低和高部分,并用_mm_add_epi16将它们相加。

I copied the result register and applied _mm_move_epi64 to the original register and added both again with _mm_add_epi16 .我复制了结果寄存器并将_mm_move_epi64应用于原始寄存器,并使用_mm_add_epi16再次添加两者。 Now, I think that I have:现在,我认为我有:
-, -, -, -, x15+x7+x11+x3, x14+x6+x10+x2, x13+x5+x9+x1, x12+x4+x8+x0 within the 128bit register. -, -, -, -, x15+x7+x11+x3, x14+x6+x10+x2, x13+x5+x9+x1, x12+x4+x8+x0在 128 位寄存器内。 But now I am stuck and don't know how to efficiently sum up the remaining four entries and how to extract the 16bit result.但是现在我被卡住了,不知道如何有效地总结剩余的四个条目以及如何提取 16 位结果。

Following the comments and hours of google my working solution:按照谷歌的评论和小时我的工作解决方案:

// AVX multiply
hash = 1;
start1 = std::chrono::high_resolution_clock::now();
for(int i=0; i<2000000; i++) {
    ZTYPE*  xv = al_entr1.c.data();
    ZTYPE*  yv = al_entr2.c.data();

    __m256i tres = _mm256_setzero_si256();
    for(int ii=0; ii < MAX_SIEVING_DIM; ii = ii+16/*8*/)
    {
        // editor's note: alignment required.  Use loadu for unaligned
        __m256i  xr = _mm256_load_si256((__m256i*)(xv+ii));
        __m256i  yr = _mm256_load_si256((__m256i*)(yv+ii));
        const __m256i tmp = _mm256_madd_epi16 (xr, yr);
        tres =  _mm256_add_epi32(tmp, tres);
    }

    // Reduction
    const __m128i x128 = _mm_add_epi32  ( _mm256_extracti128_si256(tres, 1), _mm256_castsi256_si128(tres));
    const __m128i x128_up = _mm_shuffle_epi32(x128, 78);
    const __m128i x64  = _mm_add_epi32  (x128, x128_up);
    const __m128i _x32 =  _mm_hadd_epi32(x64, x64);

    const int res = _mm_extract_epi32(_x32, 0);
    hash |= res;
}

finish1 = std::chrono::high_resolution_clock::now();
elapsed1 = finish1 - start1;
std::cout << "AVX multiply: " <<elapsed1.count() << " sec. (" << hash << ")" << std::endl;

It is at least the fastest solution so far:它至少是迄今为止最快的解决方案:

  • std::inner_product: 0.819781 sec. std::inner_product: 0.819781 秒。 (-14335) (-14335)
  • std::inner_product (aligned): 0.964058 sec. std::inner_product(对齐):0.964058 秒。 (-14335) (-14335)
  • naive multiply: 0.588623 sec.天真的乘法:0.588623 秒。 (-14335) (-14335)
  • Unroll multiply: 0.505639 sec.展开乘法:0.505639 秒。 (-14335) (-14335)
  • AVX multiply: 0.0488352 sec. AVX 乘法:0.0488352 秒。 (-14335) (-14335)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM