简体   繁体   English

如何使用缩放有效地将16位无符号short转换为8位无符号char?

[英]How to convert 16-bit unsigned short to 8-bit unsigned char using scaling efficiently?

I'm trying to convert 16 bit unsigned short data to 8 bit unsigned char using some scaling function. 我正在尝试使用某些缩放功能将16位unsigned short数据转换为8位unsigned char Currently I'm doing this by converting into float and scale down and then saturate into 8 bit. 目前,我正在通过转换为float并按比例缩小然后饱和为8位来实现此目的。 Is there any more efficient way to do this? 有没有更有效的方法来做到这一点?

int _tmain(int argc, _TCHAR* argv[])
{
    float Scale=255.0/65535.0;

    USHORT sArr[8]={512,1024,2048,4096,8192,16384,32768,65535};
    BYTE bArr[8],bArrSSE[8];        

    //Desired Conventional Method
    for (int i = 0; i < 8; i++)
    {
        bArr[i]=(BYTE)(sArr[i]*Scale);                  
    }

    __m128  vf_scale = _mm_set1_ps(Scale),
            vf_Round = _mm_set1_ps(0.5),                      
            vf_zero = _mm_setzero_ps();         
    __m128i vi_zero = _mm_setzero_si128();

    __m128i vi_src = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&sArr[0]));

    __m128 vf_Src_Lo=_mm_cvtepi32_ps(_mm_unpacklo_epi16(vi_src, _mm_set1_epi16(0)));    
    __m128 vf_Src_Hi=_mm_cvtepi32_ps(_mm_unpackhi_epi16(vi_src, _mm_set1_epi16(0)));    

    __m128 vf_Mul_Lo=_mm_sub_ps(_mm_mul_ps(vf_Src_Lo,vf_scale),vf_Round);   
    __m128 vf_Mul_Hi=_mm_sub_ps(_mm_mul_ps(vf_Src_Hi,vf_scale),vf_Round);   

    __m128i v_dst_i = _mm_packus_epi16(_mm_packs_epi32(_mm_cvtps_epi32(vf_Mul_Lo), _mm_cvtps_epi32(vf_Mul_Hi)), vi_zero);
    _mm_storel_epi64((__m128i *)(&bArrSSE[0]), v_dst_i);

    for (int i = 0; i < 8; i++)
    {       
        printf("ushort[%d]= %d     * %f = %.3f ,\tuChar[%d]= %d,\t SSE uChar[%d]= %d \n",i,sArr[i],Scale,(float)(sArr[i]*Scale),i,bArr[i],i,bArrSSE[i]);
    }

    return 0;
}

Pleas note tha the scaling factor may need to be set to other values, eg 255.0/512.0 , 255.0/1024.0 or 255.0/2048.0 , so any solution should not be hard-coded for 255.0/65535.0 . 请注意,可能需要将缩放因子设置为其他值,例如255.0/512.0 255.0/1024.0255.0/2048.0 ,因此任何解决方案都不应硬编码为255.0/65535.0

If ratio in your code is fixed, you can perform the scale with the following algorithm 如果代码中的比例是固定的,则可以使用以下算法进行缩放

  1. Shift the high byte of each word into the lower one. 将每个单词的高字节移入低位。
    Eg 0x200 -> 0x2, 0xff80 -> 0xff 例如0x200-> 0x2,0xff80-> 0xff
  2. Add an offset of -1 if the low byte was less than 0x80. 如果低字节小于0x80,则添加-1的偏移量。
    Eg 0x200 -> Offset -1, 0xff80 -> Offset 0 例如0x200->偏移-1,0xff80->偏移0

The first part is easily achieved with _mm_srli_epi16 使用_mm_srli_epi16可以轻松实现第一部分

The second one is trickier but it basically consists in taking the bit7 (the higher bit of the lower byte) of each word, replicating it all over the word and then negating it. 第二个比较棘手,但基本上是取每个单词的bit7(低字节的较高位),将其复制到整个单词,然后取反。

I used another approach: I created a vector of words valued -1 by comparing a vector with itself for equality. 我使用了另一种方法:通过将向量与自身进行比较以得出相等性,我创建了一个值为-1的单词向量。
Then I isolated the bit7 of each source word and add it to the -1 words. 然后我隔离了每个源单词的bit7并将其添加到-1个单词中。

#include <stdio.h>
#include <emmintrin.h>

int main(int argc, char* argv[])
{
    float Scale=255.0/65535.0;

    unsigned short sArr[8]={512,1024,2048,4096,8192,16384,32768,65535};
    unsigned char bArr[8], bArrSSE[16];        

    //Desired Conventional Method
    for (int i = 0; i < 8; i++)
    {
        bArr[i]=(unsigned char)(sArr[i]*Scale);                  
    }



    //Values to be converted
    __m128i vi_src = _mm_loadu_si128((__m128i const*)sArr);

    //This computes 8 words (16-bit) that are
    // -1 if the low byte of relative word in vi_src is less than 0x80
    // 0  if the low byte of relative word in vi_src is >= than 0x80

    __m128i vi_off = _mm_cmpeq_epi8(vi_src, vi_src);   //Set all words to -1
    //Add the bit15 of each word in vi_src to each -1 word
    vi_off 
    = _mm_add_epi16(vi_off, _mm_srli_epi16(_mm_slli_epi16(vi_src, 8), 15));

    //Shift vi_src word right by 8 (move hight byte into low byte)
    vi_src = _mm_srli_epi16 (vi_src, 8);  
    //Add the offsets
    vi_src = _mm_add_epi16(vi_src, vi_off); 
    //Pack the words into bytes
    vi_src = _mm_packus_epi16(vi_src, vi_src);

    _mm_storeu_si128((__m128i *)bArrSSE, vi_src);

    for (int i = 0; i < 8; i++)
    {       
        printf("%02x %02x\n",   bArr[i],bArrSSE[i]);
    }

    return 0;
}

Here is an implementation and test harness using _mm_mulhi_epu16 to perform a fixed point scaling operation. 这是使用_mm_mulhi_epu16执行定点缩放操作的实现和测试工具。

scale_ref is your original scalar code, scale_1 is the floating point SSE implementation from your (currently deleted) answer, and scale_2 is my fixed point implementation. scale_ref是您的原始标量代码, scale_1是您(当前已删除)答案中的浮点SSE实现,而scale_2是我的定点实现。

I've factored out the various implementations into separate functions and also added a size parameter and a loop, so that they can be used for any size array (although currently n must be a multiple of 8 for the SSE implementations). 我将各种实现分解为单独的函数,还添加了一个size参数和一个循环,以便它们可用于任何大小数组(尽管当前,对于SSE实现, n必须是8的倍数)。

There is a compile-time flag, ROUND , which controls whether the fixed point implementation truncates (like your scalar code) or rounds (to nearest). 有一个编译时标志ROUND ,它控制定点实现是截断(如标量代码)还是舍入(最接近)。 Truncation is slightly faster. 截断略快。

Also note that scale is a run-time parameter, currently hard-coded to 255 (equivalent to 255.0/65535.0 ) in the test harness below, but it can be any reasonable value. 还要注意, scale是一个运行时参数,在下面的测试工具中当前被硬编码为255(相当于255.0/65535.0 ),但是它可以是任何合理的值。

#include <stdio.h>
#include <stdint.h>
#include <limits.h>
#include <xmmintrin.h>

#define ROUND 1     // use rounding rather than truncation

typedef uint16_t USHORT;
typedef uint8_t BYTE;

static void scale_ref(const USHORT *src, BYTE *dest, const USHORT scale, const size_t n)
{
    const float kScale = (float)scale / (float)USHRT_MAX;

    for (size_t i = 0; i < n; i++)
    {
        dest[i] = src[i] * kScale;
    }
}

static void scale_1(const USHORT *src, BYTE *dest, const USHORT scale, const size_t n)
{
    const float kScale = (float)scale / (float)USHRT_MAX;

    __m128 vf_Scale = _mm_set1_ps(kScale);
    __m128 vf_Round = _mm_set1_ps(0.5f);

    __m128i vi_zero = _mm_setzero_si128();

    for (size_t i = 0; i < n; i += 8)
    {
        __m128i vi_src = _mm_loadu_si128((__m128i *)&src[i]);

        __m128 vf_Src_Lo = _mm_cvtepi32_ps(_mm_unpacklo_epi16(vi_src, _mm_set1_epi16(0)));
        __m128 vf_Src_Hi = _mm_cvtepi32_ps(_mm_unpackhi_epi16(vi_src, _mm_set1_epi16(0)));
        __m128 vf_Mul_Lo = _mm_mul_ps(vf_Src_Lo, vf_Scale);
        __m128 vf_Mul_Hi = _mm_mul_ps(vf_Src_Hi, vf_Scale);

        //Convert -ive to +ive Value
        vf_Mul_Lo = _mm_max_ps(_mm_sub_ps(vf_Round, vf_Mul_Lo), vf_Mul_Lo);
        vf_Mul_Hi = _mm_max_ps(_mm_sub_ps(vf_Round, vf_Mul_Hi), vf_Mul_Hi);

        __m128i v_dst_i = _mm_packus_epi16(_mm_packs_epi32(_mm_cvtps_epi32(vf_Mul_Lo), _mm_cvtps_epi32(vf_Mul_Hi)), vi_zero);
        _mm_storel_epi64((__m128i *)&dest[i], v_dst_i);
    }
}

static void scale_2(const USHORT *src, BYTE *dest, const USHORT scale, const size_t n)
{
    const __m128i vk_scale = _mm_set1_epi16(scale);
#if ROUND
    const __m128i vk_round = _mm_set1_epi16(scale / 2);
#endif

    for (size_t i = 0; i < n; i += 8)
    {
        __m128i v = _mm_loadu_si128((__m128i *)&src[i]);
#if ROUND
        v = _mm_adds_epu16(v, vk_round);
#endif
        v = _mm_mulhi_epu16(v, vk_scale);
        v = _mm_packus_epi16(v, v);
        _mm_storel_epi64((__m128i *)&dest[i], v);
    }
}

int main(int argc, char* argv[])
{
    const size_t n = 8;
    const USHORT scale = 255;

    USHORT src[n] = { 512, 1024, 2048, 4096, 8192, 16384, 32768, 65535 };
    BYTE dest_ref[n], dest_1[n], dest_2[n];

    scale_ref(src, dest_ref, scale, n);
    scale_1(src, dest_1, scale, n);
    scale_2(src, dest_2, scale, n);

    for (size_t i = 0; i < n; i++)
    {
        printf("src = %u, ref = %u, test_1 = %u, test_2 = %u\n", src[i], dest_ref[i], dest_1[i], dest_2[i]);
    }

    return 0;
}

Ok found the solution with reference to this . Ok参照找到了解决方案。

Here is my Solution: 这是我的解决方案:

int _tmain(int argc, _TCHAR* argv[])
{
    float Scale=255.0/65535.0;

    USHORT sArr[8]={512,1024,2048,4096,8192,16384,32768,65535};
    BYTE bArr[8],bArrSSE[8];        

    //Desired Conventional Method
    for (int i = 0; i < 8; i++)
    {
        bArr[i]=(BYTE)(sArr[i]*Scale);                  
    }

    __m128  vf_scale = _mm_set1_ps(Scale),                      
            vf_zero = _mm_setzero_ps();         
    __m128i vi_zero = _mm_setzero_si128();

    __m128i vi_src = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&sArr[0]));

    __m128 vf_Src_Lo=_mm_cvtepi32_ps(_mm_unpacklo_epi16(vi_src, _mm_set1_epi16(0)));    
    __m128 vf_Src_Hi=_mm_cvtepi32_ps(_mm_unpackhi_epi16(vi_src, _mm_set1_epi16(0)));    
    __m128 vf_Mul_Lo=_mm_mul_ps(vf_Src_Lo,vf_scale);    
    __m128 vf_Mul_Hi=_mm_mul_ps(vf_Src_Hi,vf_scale);

    //Convert -ive to +ive Value
    vf_Mul_Lo=_mm_max_ps(_mm_sub_ps(vf_zero, vf_Mul_Lo), vf_Mul_Lo);
    vf_Mul_Hi=_mm_max_ps(_mm_sub_ps(vf_zero, vf_Mul_Hi), vf_Mul_Hi);

    __m128i v_dst_i = _mm_packus_epi16(_mm_packs_epi32(_mm_cvtps_epi32(vf_Mul_Lo), _mm_cvtps_epi32(vf_Mul_Hi)), vi_zero);
    _mm_storel_epi64((__m128i *)(&bArrSSE[0]), v_dst_i);

    for (int i = 0; i < 8; i++)
    {       
        printf("ushort[%d]= %d     * %f = %.3f ,\tuChar[%d]= %d,\t SSE uChar[%d]= %d \n",i,sArr[i],Scale,(float)(sArr[i]*Scale),i,bArr[i],i,bArrSSE[i]);
    }

    return 0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM