为什么这个 C++ 代码只比 Python 快一点？

Question

我将一些 python 代码转换为 c++ 希望获得一些性能优势，但发现 c++ 的实现只是稍微快一点。 我正在转换的代码是来自 scipy 库的 sos 过滤器。 我的 python 测试是；

a_wgt_coefs = [[0.2343006, 0.46860119, 0.2343006, 1., -0.22455845, 0.01260662],
               [1., -2., 1., 1., -1.89387049, 0.89515977, ],
               [1., -2., 1., 1., -1.99461446, 0.99462171]]

# Define input signal
fs = 48000
T = 1.0
N = int(fs * T)
t = np.linspace(0, T, N)

ip_signal = np.sin(2 * np.pi * 440 * t)

# Filter signal
num_runs = 1000


def main():
    durations = 0
    for n in range(num_runs):
        t_start = perf_counter()
        op_signal = signal.sosfilt(a_wgt_coefs, ip_signal)
        t_end = perf_counter()
        durations = durations + (t_end - t_start)
    avg_duration = durations / num_runs
    print(f'Average execution time = {avg_duration} seconds')


if __name__ == '__main__':
    main()

C++ 代码是 scipy _sosfilt function 的行转换行，我已经这样实现了；

inline void sosfilt_cls_4(float sos[3][6], float x[SAMPLE_RATE]) {
    float x_n = 0, x_c = 0;
    float zi[3][2] = { 0 };
    // iterate over every i sample section
    for (size_t i = 0; i < SAMPLE_RATE; ++i)
    {
        x_c = x[i];
        // iterate over every j section sample
        for (size_t j = 0; j < 3; ++j)
        {
            float* section = sos[j];
            float* zi_n = zi[j];
            x_n = section[0] * x_c + zi_n[0];
            zi_n[0] = section[1] * x_c - section[4] * x_n + zi_n[1];
            zi_n[1] = section[2] * x_c - section[5] * x_n;
            x_c = x_n;
        }
        x[i] = x_c;
    }
    return;
}

我使用 std::chrono 对此进行了基准测试；

float input_array[SAMPLE_RATE];
float sum = 0;

float sos_fs_48k_array_flt[3][6] = {{0.2343006f,   0.46860119f,  0.2343006f,   1.f, -0.22455845f,  0.01260662f},
 { 1.f, -2.f,          1.f,          1.f, -1.89387049f,  0.89515977f,},
 { 1.f, -2.f,          1.f,          1.f, -1.99461446f,  0.99462171f} };

int main()
{
    auto lin = linspace(0.0, 1.0, double(samples));

    std::cout << "Testing\n\n";
    for (int x = 0; x < runs; x++) {
        for (int x = 0; x < samples; x++) {
            input_col_vector(x, 0) = sin(2 * M_PI * 440 * lin.coeff(x, 0));
            input_array[x] = sin(2 * M_PI * 440 * lin.coeff(x, 0));
        }
        auto start = std::chrono::steady_clock::now();
        sosfilt_cls_4(sos_fs_48k_array_flt, input_array);
        auto stop = std::chrono::steady_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
        sum += elapsed.count();

    }
    std::cout << "Average time:\t" << (sum / runs)/ 1e6 << std::endl;
    std::cout << "End.\n";
    return 0;
}

python 代码在 0.0002448 秒内运行，平均超过 1000 次运行。 相比之下，c++ 代码的运行时间为 0.0002336 秒，平均运行 1000 次。

我已经在 Visual Studio 中设置了我的编译器选项，以优先考虑速度而不是空间，并设置了 Ox 标志。 我还在使用 fp:fast floating point model 以及 AVX2 增强型指令集。

我为提高速度所做的其他事情是在堆栈中分配所有输入数据，使用浮点数而不是双精度数，但它们并没有真正产生影响。

奇怪的是，在一次运行中，c++ 代码要快得多，运行时间为 0.0002333，而 python 运行时间为 0.00045

Answer 1

您正在复制的 function 在 Cython 中实现，请参阅https://github.com/scipy/scipy/blob/v1.7.1/scipy/signal/_sosfilt.pyx 。

它也专门用于float箱。 因此，如果它是仔细编写的（我认为是这样），它可能永远不必回调到 Python 并且实际上是纯 C 代码，编译为 Z0D61F8370CAD1D412F80B84D143E125Z 代码，与您的代码的语法有点不同。

Answer 2

scipy/numpy 基本上调用了用 C/Fortran/C++ 编写的优化数字库。

唯一的额外开销是转换到/从 python 类型，使用 cpython api 非常快。

为什么这个 C++ 代码只比 Python 快一点？

问题描述

2 个解决方案

解决方案1
1 2022-01-03 18:51:28

解决方案2
0 2022-01-03 18:45:49

为什么这个 C++ 代码只比 Python 快一点？

问题描述

2 个解决方案

解决方案1 1 2022-01-03 18:51:28

解决方案2 0 2022-01-03 18:45:49

解决方案1
1 2022-01-03 18:51:28

解决方案2
0 2022-01-03 18:45:49