xtensor 的“operator/”比 numpy 的“/”慢

Question

I'm trying to transfer some code I've previously written in python into C++, and I'm currently testing xtensor to see if it can be faster than numpy for doing what I need it to.我正在尝试将我之前在 python 中编写的一些代码转移到 C++ 中，并且我目前正在测试 xtensor 以查看它是否可以比 numpy 更快。

One of my functions takes a square matrix d and a scalar alpha, and performs the elementwise operation alpha/(alpha+d) .我的一个函数采用方阵 d 和标量 alpha，并执行元素操作alpha/(alpha+d) 。 Background: this function is used to test which value of alpha is 'best', so it is in a loop where d is always the same, but alpha varies.背景：此 function 用于测试alpha的哪个值是“最佳”，因此它处于循环中，其中d始终相同，但alpha变化。

All of the following time scales are an average of 100 instances of running the function.以下所有时间尺度都是运行 function 的 100 个实例的平均值。

In numpy, it takes around 0.27 seconds to do this, and the code is as follows:在 numpy 中，执行此操作大约需要 0.27 秒，代码如下：

def kfun(d,alpha):
    k = alpha /(d+alpha)
    return k

but xtensor takes about 0.36 seconds, and the code looks like this:但是 xtensor 大约需要 0.36 秒，代码如下所示：

xt::xtensor<double,2> xk(xt::xtensor<double,2> d, double alpha){
    return alpha/(alpha+d);
}

I've also attempted the following version using std::vector but this something I do not want to use in long run, even though it only took 0.22 seconds.我也尝试过使用std::vector的以下版本，但这是我不想长期使用的东西，即使它只花了 0.22 秒。

std::vector<std::vector<double>> kloops(std::vector<std::vector<double>> d, double alpha, int d_size){
    for (int i = 0; i<d_size; i++){
        for (int j = 0; j<d_size; j++){
            d[i][j] = alpha/(alpha + d[i][j]);
        }
    }
    return d;
}

I've noticed that the operator/ in xtensor uses "lazy broadcasting", is there maybe a way to make it immediate?我注意到 xtensor 中的operator/使用“延迟广播”，有没有办法让它立即生效？

EDIT:编辑：

In Python, the function is called as follows, and timed using the "time" package在 Python 中，function 调用如下，并使用“时间”package 进行计时

t0 = time.time()
for i in range(100):
    kk = k(dsquared,alpha_squared)
print(time.time()-t0)

In C++ I call the function has follows, and is timed using chronos:在 C++ 中，我调用 function 如下，并使用计时：

//d is saved as a 1D npy file, an artefact from old code
auto sd2 = xt::load_npy<double>("/path/to/d.npy");

shape = {7084, 7084};
    xt::xtensor<double, 2> xd2(shape);
    for (int i = 0; i<7084;i++){
        for (int j=0; j<7084;j++){
            xd2(i,j) = (sd2(i*7084+j));
        }
    }

auto start = std::chrono::steady_clock::now();
for (int i = 0;i<10;i++){
    matrix<double> kk = kfun(xd2,4000*4000,7084);
}
auto end = std::chrono::steady_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "k takes: " << elapsed_seconds.count() << "\n";

If you wish to run this code, I'd suggest using xd2 as a symmetric 7084x7084 random matrix with zeros on the diagonal.如果您希望运行此代码，我建议使用xd2作为对角线上为零的对称 7084x7084 随机矩阵。

The output of the function, a matrix called k , then goes on to be used in other functions, but I still need d to be unchanged as it will be reused later. function 的 output，一个称为k的矩阵，然后继续用于其他功能，但我仍然需要d保持不变，因为它将在以后重用。

END EDIT结束编辑

To run my C++ code I use the following line in the terminal:要运行我的 C++ 代码，我在终端中使用以下行：

cd "/path/to/src/" && g++ -mavx2 -ffast-math -DXTENSOR_USE_XSIMD -O3 ccode.cpp -o ccode -I/path/to/xtensorinclude && "/path/to/src/"ccode

Thanks in advance!提前致谢！

Answer 1

A problem with the C++ implementation may be that it creates one or possibly even two temporary copies that could be avoided. C++ 实现的一个问题可能是它创建了一个甚至两个可以避免的临时副本。 The first copy comes from not passing the argument by reference (or perfect forwarding).第一个副本来自不通过引用（或完美转发）传递参数。 Without looking at the rest of the code its hard to judge if this has an impact on the performance or not.如果不查看代码的 rest，很难判断这是否对性能有影响。 The compiler may move d into the method if its guaranteed to be not used after the method xk() , but it is more likely to copy the data into d .如果保证在方法xk()之后不使用d ，编译器可能会将其移动到方法中，但更有可能将数据复制到d中。

To pass by reference, the method could be changed to要通过引用传递，可以将方法更改为

xt::xtensor<double,2> xk(const xt::xtensor<double,2>& d, double alpha){
    return alpha/(alpha+d);
}

To use perfect forwarding (and also enable other xtensor containers like xt::xarray or xt::xtensor_fixed ), the method could be changed to要使用完美转发（并启用其他 xtensor 容器，如xt::xarray或xt::xtensor_fixed ），可以将方法更改为

template<typename T>
xt::xtensor<double,2> xk(T&& d, double alpha){
    return alpha/(alpha+d);
}

Furthermore, its possible that you can save yourself from reserving memory for the return value.此外，您可以避免为返回值保留 memory。 Again, its hard to judge without seeing the rest of the code.同样，如果没有看到代码的 rest，很难判断。 But if the method is used inside a loop, and the return value always has the same shape, then it can be beneficial to create the return value outside of the loop and return by reference.但是，如果该方法在循环内使用，并且返回值始终具有相同的形状，那么在循环外创建返回值并通过引用返回可能是有益的。 To do this, the method could be changed to:为此，可以将方法更改为：

template<typename T, typename U>
void xk(T& r, U&& d, double alpha){
    r = alpha/(alpha+d);
}

If it is guaranteed that d and r do not point to the same memory, you can further wrap r in xt::noalias() to avoid a temporary copy before assigning the result.如果保证d和r不指向同一个 memory，则可以在分配临时复制之前将r进一步包装在xt::noalias()中以避免结果。 The same is true for the return value of the function in case you do not return by reference. function 的返回值也是如此，以防您不通过引用返回。

Good luck and happy coding!祝你好运，编码愉快！

xtensor 的“operator/”比 numpy 的“/”慢

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-03-19 18:21:51

xtensor 的“operator/”比 numpy 的“/”慢

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-03-19 18:21:51

解决方案1
3 已采纳 2021-03-19 18:21:51