Fastest way to access each pixel of an image?

Question

I am trying to find the fastest way to access the pixels in an image. I have tried two options:

#include <opencv2/opencv.hpp>
#include <iostream>
using namespace cv;
using namespace std;

// Define a pixel 
typedef Point3_<uint8_t> Pixel;

void complicatedThreshold(Pixel& pixel);

int main()
{
    cv::Mat frame = imread("img.jpg");

    clock_t t1, t2;
    t1 = clock();

    for (int i = 0; i < 10; i++)
    {
        //===================
        // Option 1: Using pointer arithmetic 
        //===================
        const Pixel* endPixel = pixel + frame.cols * frame.rows;
        for (; pixel != endPixel; pixel++)
        {
            complicatedThreshold(*pixel);
        }

        //===================
        // Option 2: Call forEach
        //===================
        frame.forEach<Pixel>
            (
                [](Pixel& pixel, const int* position) -> void
                {
                    complicatedThreshold(pixel);
                }
        );
    }

    t2 = clock();
    float t_diff((float)t2 - (float)t1);
    float seconds = t_diff / CLOCKS_PER_SEC;
    float mins = seconds / 60.0;
    float hrs = mins / 60.0;

    cout << "Execution Time (mins): " << mins << "\n";

    cvWaitKey(1);
}

void complicatedThreshold(Pixel& pixel)
{
    if (pow(double(pixel.x) / 10, 2.5) > 100)
    {
        pixel.x = 255;
        pixel.y = 255;
        pixel.z = 255;
    }
    else
    {
        pixel.x = 0;
        pixel.y = 0;
        pixel.z = 0;
    }
}

option 1 is much slower than option 2 (0.0034 > 0.001), which is what I expected according to this page .

Is there a more efficient way to access the pixels of an image?

Answer 1

This isn't really about pixel access. It's more about the amount of calculations you do per pixel, possibly vectorizing the calculations, possibly parallelizing the calculations (as you have done in your second attempt), and much much more (but we can fortunately ignore those details here).

Let's first focus on a scenario where we use no explicit parallelization (ie no forEach for now).

Let's start with your original threshold function, make it a little terser, and mark it as inline (which helps marginally):

inline void complicatedThreshold(Pixel& pixel)
{
    if (std::pow(double(pixel.x) / 10, 2.5) > 100) {
        pixel = { 255, 255, 255 };
    } else {
        pixel = { 0, 0, 0 };
    }
}

and drive it in the following manner:

void impl_1(cv::Mat frame)
{
    auto pixel = frame.ptr<Pixel>();
    auto const endPixel = pixel + frame.total();
    for (; pixel != endPixel; ++pixel) {
        complicatedThreshold(*pixel);
    }
}

We will test this (and the improved versions) on a randomly generated 3 channel image of size 8192x8192.

The baseline completes in 3139 ms.

Using impl_1 as a baseline, we will check all the improvements for correctness using the following template function:

template <typename T>
void require_same_result(cv::Mat frame, T const& fn1, T const& fn2)
{
    cv::Mat working_frame_1(frame.clone());
    fn1(working_frame_1);

    cv::Mat working_frame_2(frame.clone());
    fn2(working_frame_2);


    if (cv::sum(working_frame_1 != working_frame_2) != cv::Scalar(0, 0, 0, 0)) {
        throw std::runtime_error("Mismatch.");
    }
}

Improvement 1

We can try to take advantage of optimized functions that OpenCV provides.

Let's recall that for each pixel we perform a threshold operation on the following condition:

std::pow(double(pixel.x) / 10, 2.5) > 100

First of all, we only need the first channel for our calculations. Let's extract it using cv::extractChannel .

Next, we need to convert the first channel to double type. To do this, we can use cv::Mat::convertTo . This function provides another advantage -- it allows us to specify a scaling factor. We can provide alpha factor of 0.1 to take care of the division by 10 in the same call.

As the next step, we use cv::pow to perform the exponentiation in an efficient manner on the whole array. We compare the result with the threshold value of 100. The comparison operator that OpenCV provides will return 255 for true and 0 for false . Given that, we just have to merge 3 identical copies of the resulting array and we're done.

void impl_2(cv::Mat frame)
{
    cv::Mat1b first_channel;
    cv::extractChannel(frame, first_channel, 0);

    cv::Mat1d tmp;
    first_channel.convertTo(tmp, CV_64FC1, 0.1);
    cv::pow(tmp, 2.5, tmp);

    first_channel = tmp > 100;

    cv::merge(std::vector<cv::Mat>{ first_channel, first_channel, first_channel }, frame);
}

This implementation completes in 842 ms.

Improvement 2

This calculation doesn't really require double precision... let's perform it with floats only.

void impl_3(cv::Mat frame)
{
    cv::Mat1b first_channel;
    cv::extractChannel(frame, first_channel, 0);

    cv::Mat1f tmp;
    first_channel.convertTo(tmp, CV_32FC1, 0.1);
    cv::pow(tmp, 2.5, tmp);

    first_channel = tmp > 100;

    cv::merge(std::vector<cv::Mat>{ first_channel, first_channel, first_channel }, frame);
}

This implementation completes in 516 ms.

Improvement 3

OK, but hold on. For each pixel we have to divide by 10 (or multiply by 0.1), then calculate the 2.5th exponent (that's gonna be expensive)... but there are only 256 possible input values for an image that could have millions of pixels. What if we pre-calculated a lookup table and used that instead of per-pixel calculations?

cv::Mat make_lut()
{
    cv::Mat1b result(256, 1);
    for (uint32_t i(0); i < 256; ++i) {
        if (pow(double(i) / 10, 2.5) > 100) {
            result.at<uchar>(i, 0) = 255;
        } else {
            result.at<uchar>(i, 0) = 0;
        }
    }
    return result;
}

void impl_4(cv::Mat frame)
{
    cv::Mat lut(make_lut());

    cv::Mat first_channel;
    cv::extractChannel(frame, first_channel, 0);

    cv::LUT(first_channel, lut, first_channel);

    cv::merge(std::vector<cv::Mat>{ first_channel, first_channel, first_channel }, frame);
}

This implementation completes in 68 ms.

Improvement 4

However, we don't really need a look-up table. We can do some math to simplify that "complicated" threshold function:

$<code>\left(\frac{x}{10}\right)^{2.5} > 100</code>$

Let's apply appropriate reciprocal to eliminate the exponentiation on the left hand side.

$<code>\frac{x}{10} > \sqrt[2.5]{100}</code>$

And let us implify right hand side (it's a constant).

$<code>\frac{x}{10} > 6.30957</code>$

Finally let's multiply by 10 to eliminate the fraction on the left hand side.

And since we are only dealing with integers we can use

x > 63

OK, let's try this with the first variant.

inline void complicatedThreshold_2(Pixel& pixel)
{
    if (pixel.x > 63) {
        pixel = { 255, 255, 255 };
    } else {
        pixel = { 0, 0, 0 };
    }
}

void impl_5(cv::Mat frame)
{
    auto pixel = frame.ptr<Pixel>();
    auto const endPixel = pixel + frame.total();
    for (; pixel != endPixel; pixel++) {
        complicatedThreshold_2(*pixel);
    }
}

This implementation completes in 166 ms.

Note: As bad as this may seem compared to the previous step, this is almost 20x improvement against the similar baseline.

Improvement 5

That really looks like a threshold operation on the first channel, that's replicated onto the remaining 2 channels.

void impl_6(cv::Mat frame)
{
    cv::Mat first_channel;
    cv::extractChannel(frame, first_channel, 0);

    cv::threshold(first_channel, first_channel, 63, 255, cv::THRESH_BINARY);

    cv::merge(std::vector<cv::Mat>{ first_channel, first_channel, first_channel }, frame);
}

This implementation completes in 65 ms.

Time to try parallelizing this. Let's start with forEach .

Parallel implementation of the baseline algorithm:

void impl_7(cv::Mat frame)
{
    frame.forEach<Pixel>(
        [](Pixel& pixel, const int* position)
        {
            complicatedThreshold(pixel);
        }
    );
}

This implementation completes in 350 ms.

Parallel implementation of the simplified algorithm:

void impl_8(cv::Mat frame)
{
    frame.forEach<Pixel>(
        [](Pixel& pixel, const int* position)
        {
            complicatedThreshold_2(pixel);
        }
    );
}

This implementation completes in 20 ms.

That's pretty good, we're at around 157 times improvement compared to the original naive algorithm. Even beats the best non-parallelize attempt almost 3 times. Can we do better?

Further Improvements

One more easy option is to try parallel_for_ .

typedef void(*impl_fn)(cv::Mat);

void impl_parallel(cv::Mat frame, impl_fn const& fn)
{
    cv::parallel_for_(cv::Range(0, frame.rows), [&](const cv::Range& range) {
        for (int r = range.start; r < range.end; r++) {
            fn(frame.row(r));
        }
    });
}


void impl_9(cv::Mat frame)
{
    impl_parallel(frame, impl_1);
}

void impl_10(cv::Mat frame)
{
    impl_parallel(frame, impl_2);
}

void impl_11(cv::Mat frame)
{
    impl_parallel(frame, impl_3);
}

void impl_12(cv::Mat frame)
{
    impl_parallel(frame, impl_4);
}

void impl_13(cv::Mat frame)
{
    impl_parallel(frame, impl_5);
}

void impl_14(cv::Mat frame)
{
    impl_parallel(frame, impl_6);
}

The timings are:

Test 9 minimum: 355 ms.
Test 10 minimum: 108 ms.
Test 11 minimum: 62 ms.
Test 12 minimum: 25 ms.
Test 13 minimum: 19 ms.
Test 14 minimum: 11 ms.

So, there you go, 285x improvement on a 6 core CPU with HT enabled.

Answer 2

OpenCV provides a high-level parallel graphics library that takes advantage of special CPU and GPU instruction sets and also utilizes the OpenCL unified parallel platform. OpenCV algorithms are optimized enough to be among the fastest libraries.
On the other hand, all high-level libraries lose a little performance to reach the specified level of unification, simplicity, performance, etc. You almost always able to develop a faster code for a specific and limited problem by using native and low-level programming instructions and APIs, but it usually needs much more knowledge of parallel programming as well as much more development time. The final source code also will be more complicated.

Fastest way to access each pixel of an image?

Question

2 answers

solution1
6 ACCPTED 2019-10-29 02:32:52

Improvement 1

Improvement 2

Improvement 3

Improvement 4

Improvement 5

Further Improvements

solution2
0 2019-10-28 19:38:51

Fastest way to access each pixel of an image?

Question

2 answers

solution1 6 ACCPTED 2019-10-29 02:32:52

Improvement 1

Improvement 2

Improvement 3

Improvement 4

Improvement 5

Further Improvements

solution2 0 2019-10-28 19:38:51

solution1
6 ACCPTED 2019-10-29 02:32:52

solution2
0 2019-10-28 19:38:51