简体   繁体   中英

Converting a RGBA image to RGB image

I try to convert a RGBA image to a RGB image (8-bit unsigned integer per channel). At first I used OpenCV and the following function for that

m_bufferMat.data = (uchar*) (ptr1);
m_bufferMat.convertTo(m_bufferMat, CV_8UC3);

But for the other parts of the application I don't need to use OpenCV, so I tried to convert the image myself so I don't need to link and include the OpenCV library. The fastest method I could imagine is to iterate trough the buffer and copy just the first 3 bytes to another buffer like the following:

for(int i = 0; i < width * height; i++) {
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    ptr1++;
}

But for that I need to copy which could be not realy fast. The OpenCV function is 1.5 times faster than my own function. Has anybody an idea why? Can I implement a function where I don't need to copy?

There are many optimizations that could be done. Here is a test bench program to try them and a few example optimizations:

#include <iostream>
#include <string>
#include <vector>
#include <intrin.h>
#include <functional>

volatile int width = 1920;
volatile int height = 1080;

unsigned char* src = new unsigned char[width * height * 4];
unsigned char* dst = new unsigned char[width * height * 3];
unsigned char* refDst = new unsigned char[width * height * 3];


void DefaultFunc() {
  auto ptr1 = src;
  auto ptr2 = dst;
  for (int i = 0; i < width * height; i++) {
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    ptr1++;
  }
}

void NPreCalculatedFunc() {
  auto ptr1 = src;
  auto ptr2 = dst;
  auto n = width * height;
  for (int i = 0; i < n; i++) {
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    *(ptr2++) = *(ptr1++);
    ptr1++;
  }
}

void ReadFullPixelFunc() {
  unsigned int* ptr1 = (unsigned int*)src;
  auto ptr2 = dst;
  auto n = width * height;
  for (int i = 0; i < n; i++) {
    auto srcPix = *(ptr1++);
    *(ptr2++) = srcPix & 0xff;
    *(ptr2++) = (srcPix >> 8) & 0xff;
    *(ptr2++) = (srcPix >> 16) & 0xff;
  }
}
  

void ReadAndWriteFullPixelFunc() {
  unsigned int* ptr1 = (unsigned int*)src;
  unsigned int* ptr2 = (unsigned int*)dst;
  auto n = width * height / 4; 
  unsigned int writeBuf = 0;
  for (int i = n; i; i--) {   
    // by reading 4 pixels, we get to store 3 unsigned ints
    auto srcPix = *(ptr1++);    
    writeBuf = srcPix & 0x00ffffff;
    srcPix = *(ptr1++);
    writeBuf |= srcPix << 24;
    *(ptr2++) = writeBuf;
    
    writeBuf = (srcPix >> 8) & 0xffff;
    srcPix = *(ptr1++);
    writeBuf |= (srcPix << 16);
    *(ptr2++) = writeBuf;

    writeBuf = (srcPix >> 16) & 0xff;
    srcPix = *(ptr1++);
    writeBuf |= (srcPix << 8);
    *(ptr2++) = writeBuf;
  }
  // todo: if width * height is not divisible by 4, process the last max 3 pixels here with the unoptimized loop
}



unsigned long long PrintShortestTime(std::function<void()> f, const char *label, unsigned long long refTime) {
  unsigned long long minTicks = ~0ull;
  memset(dst, 0, width * height * 3);
  for (int i = 0; i < 500; i++) {
    auto start = __rdtsc();
    f();
    auto end = __rdtsc();
    auto duration = end - start;
    if (duration < minTicks) {
      minTicks = duration;
    }
  }
  if (memcmp(refDst, dst, width * height * 3)) {
    printf("Fail - result does not equal refrence!\n");
  }
  printf("%s : %llu clock cycles - %0.3lf x base implementation time\n", label, minTicks, refTime ? ((double)minTicks/(double)refTime):1.0);
  return minTicks;
}

int main() {
  for (int i = 0; i < width * height * 4; i++) {
    src[i] = rand() & 0xff;
  }
  DefaultFunc();
  memcpy(refDst, dst, width * height * 3);


  auto refTime = PrintShortestTime(DefaultFunc, "default, unoptimized", 0);  

  PrintShortestTime(NPreCalculatedFunc, "n precalculated", refTime);

  PrintShortestTime(ReadFullPixelFunc, "n precalculated, reading 1 pixel at a time", refTime);
    
  PrintShortestTime(ReadAndWriteFullPixelFunc, "n precalculated, reading and writing ints at a time", refTime);
}

For me, on visual studio & x64, the last version is about 2x as fast as the naive version:

default, unoptimized : 7484728 clock cycles - 1.000 x base implementation time
n precalculated : 7269432 clock cycles - 0.971 x base implementation time
n precalculated, reading 1 pixel at a time : 7258632 clock cycles - 0.970 x base implementation time
n precalculated, reading and writing ints at a time : 3484200 clock cycles - 0.466 x base implementation time

Faster implementations would require using the SIMD instructions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM