简体   繁体   English

从__m128i中查找最小/最大值

[英]Find min/max value from a __m128i

I want to find the minimum/maximum value into an array of byte using SIMD operations. 我想使用SIMD操作在字节数组中找到最小/最大值。 So far I was able to go through the array and store the minimum/maximum value into a __m128i variable, but it means that the value I am looking for is mixed among others (15 others to be exact). 到目前为止,我能够通过数组并将最小/最大值存储到__m128i变量中,但这意味着我正在寻找的值与其他值混合在一起(确切地说是15个其他值)。

I've found these discussions here and here for integer, and this page for float, but I don't understand how works _mm_shuffle*. 我在这里这里找到了这些讨论的整数, 这个页面用于浮点数,但我不明白如何工作_mm_shuffle *。 So my questions are: 所以我的问题是:

  1. What SIMD operations do I have to perform in order to extract the minimum / maximum byte (or unsigned byte) value from the __m128i variable? 为了从__m128i变量中提取最小/最大字节(或无符号字节)值,我必须执行哪些SIMD操作?
  2. How does _mm_shuffle* work? _mm_shuffle *如何工作? I don't get it when I look to the "minimal" documentation online. 当我在网上查看“最小”文档时,我不明白。 I know it is related to the _MM_SHUFFLE macro , but I don't get the example. 我知道它与_MM_SHUFFLE宏有关 ,但我没有得到这个例子。

Here is an example for horizontal max for uint8_t : 以下是uint8_t水平最大值示例:

#include "tmmintrin.h" // requires SSSE3

__m128i _mm_hmax_epu8(const __m128i v)
{
    __m128i vmax = v;

    vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 1));
    vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 2));
    vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 4));
    vmax = _mm_max_epu8(vmax, _mm_alignr_epi8(vmax, vmax, 8));

    return vmax;
}

The max value will be returned in all elements. 最大值将在所有元素中返回。 If you need the value as a scalar then use _mm_extract_epi8 . 如果您需要将值作为标量,请使用_mm_extract_epi8

It should be fairly obvious how to adapt this for min, and for signed min/max. 应该相当明显如何适应min,以及签名的min / max。

Alternatively, convert to words and use phminposuw (not tested) 或者,转换为单词并使用phminposuw (未测试)

int hminu8(__m128i x)
{
  __m128i l = _mm_unpacklo_epi8(x, _mm_setzero_si128());
  __m128i h = _mm_unpackhi_epi8(x, _mm_setzero_si128());
  l = _mm_minpos_epu16(l);
  h = _mm_minpos_epu16(h);
  return _mm_extract_epi16(_mm_min_epu16(l, h), 0);
}

By my quick count, the latency is a bit worse than a min/shuffle cascade, but the throughput a bit better. 根据我的快速计算,延迟比min / shuffle级联稍差,但吞吐量稍微好一些。 The linked answer with phminposuw is probably better though. phminposuw相关的答案可能更好。 Adapted for unsigned bytes (but not tested) 适用于无符号字节(但未经过测试)

uint8_t hminu8(__m128i x)
{
  x = _mm_min_epu8(x, _mm_srli_epi16(x, 8));
  x = _mm_minpos_epu16(x);
  return _mm_cvtsi128_si32(x);
}

You could use it for max too, but with a bit of overhead: complement the input and result. 您也可以将它用于最大值,但有一点开销:补充输入和结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM