How to optimize a matrix 3 by 3 multiplication with a point with SSE?

Question

I have to apply a transformation matrix in each point of my image to get the new point coordinate.

To do that, I created a custom Matrix3by3 class which contains a array of size 9 of floats.

To apply the matrix to each point, first I created this function:

constexpr auto apply_matrix(const Matrix3by3 & m, const Vec2i & p) -> Vec2f
{
  const auto x = m.at(0, 0) * p.x + m.at(0, 1) * p.y + m.at(0, 2);
  const auto y = m.at(1, 0) * p.x + m.at(1, 1) * p.y + m.at(1, 2);
  const auto z = m.at(2, 0) * p.x + m.at(2, 1) * p.y + m.at(2, 2);

  return { x / z, y / z };
}

As you can see, this function will do simple matrix multiplication without the last multiplication since there is no z value in my 2D images.

This works great, but since this part of the code is hot code, I'm trying to optimize it, so I created a SSE version of it:

constexpr auto apply_matrix(const Matrix3by3 & m, const Vec2i & p) -> Vec2f
{
  using SSEVec3 = union {
    struct
    {
      float z, y, x;

    };
    __m128 values_ = _mm_setzero_ps();
  };

  const auto mvec1 = _mm_set_ps(0, m.at(0, 0), m.at(0, 1), m.at(0, 2));
  const auto mvec2 = _mm_set_ps(0, m.at(1, 0), m.at(1, 1), m.at(1, 2));
  const auto mvec3 = _mm_set_ps(0, m.at(2, 0), m.at(2, 1), m.at(2, 2));

  const auto pvec1 = _mm_set1_ps(static_cast<float>(p.x));
  const auto pvec2 = _mm_set1_ps(static_cast<float>(p.y));

  auto result = SSEVec3{};
  result.values_ = _mm_add_ps(_mm_add_ps(_mm_mul_ps(mvec1, pvec1), _mm_mul_ps(mvec2, pvec2)), mvec3);

  return { result.x / result.z, result.y / result.z };
}

This works too, but it is slower than the first version, and since I'm in the process of learning SSE, I cannot see exactly why this is the case.

My idea with this second version was to do the x, y, and z value calculation in parallel.

So, that's my question, why the SSE version is slower and how can I optimize it to be as fast as possible?

Thanks!

Answer 1

Generally, optimize only what needs optimizing , not what you guess needs it.

Probably the single worst point in the (original) code, and your 'optimizations' didn't help it at all, is the duplicate division . Dividing floats or doubles is by far worse than everything else in this code, so your best optimization is to reduce it by calculating 1/z (dividing once ) into a helper variable, and then multiplying two times with the result.

But - as said in the beginning - you might not need any optimizing, or you might need others. Test, profile, and look for the slowest piece of coding. Guessing results typically in wasted effort and unnecessary code complexity.

How to optimize a matrix 3 by 3 multiplication with a point with SSE?

Question

1 answers

solution1
2 ACCPTED 2017-08-05 00:40:32

How to optimize a matrix 3 by 3 multiplication with a point with SSE?

Question

1 answers

solution1 2 ACCPTED 2017-08-05 00:40:32

solution1
2 ACCPTED 2017-08-05 00:40:32