使用数组时的抽象与性能

Question

This is a question about a concern I have about choosing between better performance and clearer code (better abstraction) when dealing with arrays. 这是一个关于在处理数组时我要在更好的性能和更清晰的代码（更好的抽象）之间进行选择的问题。 I tried to distill it down to a toy example. 我试图将其提炼成一个玩具的例子。

C++ is particularly good at allowing abstractions without hurting performance. C ++特别擅长允许抽象而不损害性能。 The question is whether this is possible in examples similar to the one below. 问题是，在与以下示例类似的示例中是否可能这样做。

Consider a trivial arbitrary-size matrix class that uses contiguous row-major storage: 考虑一个使用连续的行主存储的琐碎的任意大小矩阵类：

#include <cmath>
#include <cassert>

class Matrix {
    int nrow, ncol;
    double *data;
public:
    Matrix(int nrow, int ncol) : nrow(nrow), ncol(ncol), data(new double[nrow*ncol]) { }
    ~Matrix() { delete [] data; }

    int rows() const { return nrow; }
    int cols() const { return ncol; }

    double & operator [] (int i) { return data[i]; }

    double & operator () (int i, int j) { return data[i*ncol + j]; }
};

It has a 2D indexing operator () to make it easy to work with. 它具有一个2D索引operator () ，使其易于使用。 It also has operator [] for contiguous access, but a better-abstracted matrix may not have this. 它还具有用于连续访问的operator [] ，但是抽象性更好的矩阵可能没有此值。

Let's implement a function that takes an n-by-2 matrix, essentially a list of 2D vectors, and normalizes each vector in-place. 让我们实现一个函数，该函数采用n×2矩阵，本质上是2D向量列表，并就地标准化每个向量。

The clear way: 明确的方法：

inline double veclen(double x, double y) {
    return std::sqrt(x*x + y*y);
}

void normalize(Matrix &mat) {
    assert(mat.cols() == 2); // some kind of check for correct input
    for (int i=0; i < mat.rows(); ++i) {
        double norm = veclen(mat(i,0), mat(i,1));
        mat(i,0) /= norm;
        mat(i,1) /= norm;
    }
}

The fast, but less clear way: 快速但不太清楚的方式：

void normalize2(Matrix &mat) {
    assert(mat.cols() == 2);
    for (int i=0; i < mat.rows(); ++i) {
        double norm = veclen(mat[2*i], mat[2*i+1]);
        mat[2*i] /= norm;
        mat[2*i+1] /= norm;
    }
}

The second version ( normalize2 ) has the potential to be faster because it is written in a way that it is clear that the second iteration of the loop will not access data that was computed in the first iteration. 第二个版本（ normalize2 ）可能会更快，因为它的编写方式很显然，循环的第二个迭代将不访问在第一次迭代中计算出的数据。 Thus it can potentially make better use of SIMD instructions. 因此，它可以潜在地更好地利用SIMD指令。 Looking at godbolt, this seems to be what happens (unless I'm misreading the assembly). 看着天幕，这似乎是发生的事情（除非我误读了程序集）。

In the first version ( normalize ), the compiler can't know that the input matrix is not n-by-1, which would lead to overlapping array accesses. 在第一个版本（ normalize ）中，编译器无法知道输入矩阵不是nby-1，这将导致重叠的数组访问。

Question: Is it possible to somehow tell the compiler that the input matrix is really n-by-2 in normalize() to allow it to optimize to the same level as it does in normalize2() ? 问题：是否可以某种方式告诉编译器输入矩阵在normalize()实际上是n-by-2，以使其可以优化到与normalize2()相同的水平？

Addressing the comments: 解决意见：

John Zwinck: I went and did the benchmark. John Zwinck：我去做了基准测试。 normalize2() is considerably faster (2.4 vs 1.3 seconds), but only if I remove the assert macros or if I define NDEBUG . normalize2()的速度要快得多（2.4秒与1.3秒），但是仅当我删除assert宏或定义NDEBUG 时才如此 。 That is a rather counterintuitive effect of -DNDEBUG , isn't it? 这是-DNDEBUG的相当违反直觉的效果，不是吗？ It reduces performance instead of improving it. 它降低了性能而不是提高了性能。
Max: Evidence is both the godbolt output I linked to and the above benchmark. 马克斯：证据既是我链接的指标，也是上述基准。 I am also interested in the case when these two functions cannot be inlined (eg because they are in a separate translation unit). 对于这两个函数无法内联的情况，我也很感兴趣（例如，因为它们在单独的翻译单元中）。
Jarod42 and bolov: This is the answer I was looking for. Jarod42和bolov：这是我一直在寻找的答案。 Confirmed by the benchmark mentioned in the first point. 由第一点提到的基准确认。 Still, this is important to know in case one implements one's own assert (which is exactly what I do in my application). 尽管如此，了解一个人实现自己的assert （这正是我在我的应用程序中所做的事情）的assert下，了解这一点仍然很重要。

Answer 1

I believe templates let you achieve both performance and readability. 我相信模板可以使您同时获得性能和可读性。

By templating the size of your matrix (like popular math libraries do), you let the compiler know at compile time a lot of info. 通过确定矩阵的大小（就像流行的数学库一样），您可以让编译器在编译时知道很多信息。

I modified a bit your little class: 我修改了您的小课：

template<int R, int C>
class Matrix {
    double data[R * C] = {0.0};
public:
    Matrix() = default;

    int rows() const { return R; }
    int cols() const { return C; }
    int size() const { return R*C; }

    double & operator [] (int i) { return data[i]; }

    double & operator () (int row, int col) { return data[row*C + col]; }
};

inline double veclen(double x, double y) {
    return std::sqrt(x*x + y*y);
}

template<int R>
void normalize(Matrix<R, 2> &mat) {
    for (int i = 0; i < R; ++i) {
        double norm = veclen(mat(i, 0), mat(i, 1));
        mat(i, 0) /= norm;
        mat(i, 1) /= norm;
    }
}

template<int R>
void normalize2(Matrix<R, 2> &mat) {
    for (int i = 0; i < R; ++i) {
        double norm = veclen(mat[2 * i], mat[2 * i + 1]);
        mat[2 * i] /= norm;
        mat[2 * i + 1] /= norm;
    }
}

I also prefer to put data as plain member (=without pointer), so you can choose during the matrix construction where the memory is (stack or heap). 我还更喜欢将数据作为普通成员（=不带指针）放置，因此您可以在矩阵构造期间选择内存所在的位置（堆栈或堆）。

The nice extra is you are now sure at compile time that the normalize functions only accept n-by-2 matrix. 额外的好处是，您现在可以在编译时确定正常化函数仅接受n-by-2矩阵。

I didn't test my code on compiler explorer, because honestly I can't decipher asm. 我没有在编译器资源管理器中测试我的代码，因为老实说我无法破译asm。 So, yes, I claim my version to be better without being sure ;) 所以，是的，我不确定我的版本会更好；）

A last word: don't roll your own matrix, use a library, like glm or eigen. 最后一句话：不要滚动自己的矩阵，而要使用glm或本征之类的库。

A last word²: If you don't know what to choose, prefer readability. 最后一句话²：如果您不知道要选择什么，则更喜欢可读性。

Answer 2

An answer that is acceptable to me was essentially given by @bolov and @Jared42 in the comments. @bolov和@ Jared42在评论中基本上给出了我可以接受的答案。 Since they did not post it, I will do so myself. 由于他们没有发布，因此我会自己发布。

To let the compiler know that the matrix is of size n-by-2, it is sufficient to add code to the beginning of the function that makes the rest of the code unreachable when the matrix size is not correct. 为了让编译器知道矩阵的大小为n×2，将代码添加到函数的开头就足够了，当矩阵大小不正确时，该代码的其余部分将无法访问。

For example, adding 例如，添加

if (mat.cols() != 2)
    throw std::runtime_error("Input array is not of expected shape.");

to the beginning of normalize() allows it to run exactly as fast as normalize2() (1.3 instead of 2.4 seconds in my benchmark with clang 5.0). 到normalize()开始的位置，它的运行速度与normalize2()完全一样normalize2()在我使用clang 5.0的基准测试中为1.3秒而不是2.4秒）。

We can also add an assert(mat.cols() == 2) , but this results in the counterintuitive effect that defining -DNDEBUG during compilation makes the function considerably slower (as it removes the assertion). 我们还可以添加assert(mat.cols() == 2) ，但这会产生反直觉的效果， -DNDEBUG在编译过程中定义-DNDEBUG会使函数变慢（因为它删除了断言）。

使用数组时的抽象与性能

问题描述

2 个解决方案

解决方案1
1 2017-11-08 14:21:27

解决方案2
1

使用数组时的抽象与性能

问题描述

2 个解决方案

解决方案1 1 2017-11-08 14:21:27

解决方案2 1

解决方案1
1 2017-11-08 14:21:27

解决方案2
1