numpy怎么这么快？

Question

I'm trying to understand how numpy can be so fast, based on my shocking comparison with optimized C/C++ code which is still far from reproducing numpy's speed.我试图了解numpy如此之快，基于我与优化的 C/C++ 代码的惊人比较，该代码仍远未重现 numpy 的速度。

Consider the following example: Given a 2D array with shape=(N, N) and dtype=float32 , which represents a list of N vectors of N dimensions, I am computing the pairwise differences between every pair of vectors.考虑以下示例：给定一个shape=(N, N)和dtype=float32的二维数组，它表示 N 维的 N 个向量的列表，我正在计算每对向量之间的成对差异。 Using numpy broadcasting, this simply writes as:使用numpy广播，这简单地写成：

def pairwise_sub_numpy( X ):
    return X - X[:, None, :]

Using timeit I can measure the performance for N=512 : it takes 88 ms per call on my laptop.使用timeit我可以测量N=512的性能：在我的笔记本电脑上每次调用需要 88 毫秒。

Now, in C/C++ a naive implementation writes as:现在，在 C/C++ 中，一个简单的实现写成：

#define X(i, j)     _X[(i)*N + (j)]
#define res(i, j, k)  _res[((i)*N + (j))*N + (k)]

float* pairwise_sub_naive( const float* _X, int N ) 
{
    float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));

    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            for (int k = 0; k < N; k++)
                res(i,j,k) = X(i,k) - X(j,k);
          }
    }
    return _res;
}

Compiling using gcc 7.3.0 with -O3 flag, I get 195 ms per call for pairwise_sub_naive(X) , which is not too bad given the simplicity of the code, but about 2 times slower than numpy .使用带有-O3标志的gcc 7.3.0 进行编译，每次调用 pairwise_sub_naive pairwise_sub_naive(X)得到 195 毫秒，考虑到代码的简单性，这还不错，但比numpy慢大约 2 倍。

Now I start getting serious and add some small optimizations, by indexing the row vectors directly:现在我开始认真起来，并通过直接索引行向量来添加一些小的优化：

float* pairwise_sub_better( const float* _X, int N ) 
{
    float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));
    
    for (int i = 0; i < N; i++) {
        const float* xi = & X(i,0);

        for (int j = 0; j < N; j++) {
            const float* xj = & X(j,0);
            
            float* r = &res(i,j,0);
            for (int k = 0; k < N; k++)
                 r[k] = xi[k] - xj[k];
        }
    }
    return _res;
}

The speed stays the same at 195 ms, which means that the compiler was able to figure that much.速度在 195 毫秒时保持不变，这意味着编译器能够计算出那么多。 Let's now use SIMD vector instructions:现在让我们使用 SIMD 向量指令：

float* pairwise_sub_simd( const float* _X, int N ) 
{
    float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));

    // create caches for row vectors which are memory-aligned
    float* xi = (float*)aligned_alloc(32, N * sizeof(float));
    float* xj = (float*)aligned_alloc(32, N * sizeof(float));
    
    for (int i = 0; i < N; i++) {
        memcpy(xi, & X(i,0), N*sizeof(float));
        
        for (int j = 0; j < N; j++) {
            memcpy(xj, & X(j,0), N*sizeof(float));
            
            float* r = &res(i,j,0);
            for (int k = 0; k < N; k += 256/sizeof(float)) {
                const __m256 A = _mm256_load_ps(xi+k);
                const __m256 B = _mm256_load_ps(xj+k);
                _mm256_store_ps(r+k, _mm256_sub_ps( A, B ));
            }
        }
    }
    free(xi); 
    free(xj);
    return _res;
}

This only yields a small boost (178 ms instead of 194 ms per function call).这只会产生很小的提升（每次 function 调用 178 毫秒而不是 194 毫秒）。

Then I was wondering if a "block-wise" approach, like what is used to optimize dot-products, could be beneficials:然后我想知道“分块”方法，比如用于优化点积的方法，是否可能是有益的：

float* pairwise_sub_blocks( const float* _X, int N ) 
{
    float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));

    #define B 8
    float cache1[B*B], cache2[B*B];

    for (int bi = 0; bi < N; bi+=B)
      for (int bj = 0; bj < N; bj+=B)
        for (int bk = 0; bk < N; bk+=B) {
        
            // load first 8x8 block in the cache
            for (int i = 0; i < B; i++)
              for (int k = 0; k < B; k++)
                cache1[B*i + k] = X(bi+i, bk+k);

            // load second 8x8 block in the cache
            for (int j = 0; j < B; j++)
              for (int k = 0; k < B; k++)
                cache2[B*j + k] = X(bj+j, bk+k);

            // compute local operations on the caches
            for (int i = 0; i < B; i++)
             for (int j = 0; j < B; j++)
              for (int k = 0; k < B; k++)
                 res(bi+i,bj+j,bk+k) = cache1[B*i + k] - cache2[B*j + k];
         }
    return _res;
}

And surprisingly, this is the slowest method so far (258 ms per function call).令人惊讶的是，这是迄今为止最慢的方法（每次 function 调用需要 258 毫秒）。

To summarize, despite some efforts with some optimized C++ code, I can't come anywhere close the 88 ms / call that numpy achieves effortlessly.总而言之，尽管对一些优化的 C++ 代码进行了一些努力，但我无法接近 numpy 轻松实现的 88 毫秒/调用。 Any idea why?知道为什么吗？

Note: By the way, I am disabling numpy multi-threading and anyway, this kind of operation is not multi-threaded.注意：顺便说一句，我正在禁用numpy多线程，无论如何，这种操作不是多线程的。

Edit : Exact code to benchmark the numpy code:编辑：对 numpy 代码进行基准测试的确切代码：

import numpy as np

def pairwise_sub_numpy( X ):
    return X - X[:, None, :]

N = 512
X = np.random.rand(N,N).astype(np.float32)

import timeit
times = timeit.repeat('pairwise_sub_numpy( X )', globals=globals(), number=1, repeat=5)
print(f">> best of 5 = {1000*min(times):.3f} ms")

Full benchmark for C code: C 代码的完整基准测试：

#include <stdio.h>
#include <string.h>
#include <xmmintrin.h>   // compile with -mavx -msse4.1
#include <pmmintrin.h>
#include <immintrin.h>
#include <time.h>


#define X(i, j)     _x[(i)*N + (j)]
#define res(i, j, k)  _res[((i)*N + (j))*N + (k)]

float* pairwise_sub_naive( const float* _x, int N ) 
{
    float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));

    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            for (int k = 0; k < N; k++)
                res(i,j,k) = X(i,k) - X(j,k);
          }
    }
    return _res;
}

float* pairwise_sub_better( const float* _x, int N ) 
{
    float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));
    
    for (int i = 0; i < N; i++) {
        const float* xi = & X(i,0);

        for (int j = 0; j < N; j++) {
            const float* xj = & X(j,0);
            
            float* r = &res(i,j,0);
            for (int k = 0; k < N; k++)
                 r[k] = xi[k] - xj[k];
        }
    }
    return _res;
}

float* pairwise_sub_simd( const float* _x, int N ) 
{
    float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));

    // create caches for row vectors which are memory-aligned
    float* xi = (float*)aligned_alloc(32, N * sizeof(float));
    float* xj = (float*)aligned_alloc(32, N * sizeof(float));
    
    for (int i = 0; i < N; i++) {
        memcpy(xi, & X(i,0), N*sizeof(float));
        
        for (int j = 0; j < N; j++) {
            memcpy(xj, & X(j,0), N*sizeof(float));
            
            float* r = &res(i,j,0);
            for (int k = 0; k < N; k += 256/sizeof(float)) {
                const __m256 A = _mm256_load_ps(xi+k);
                const __m256 B = _mm256_load_ps(xj+k);
                _mm256_store_ps(r+k, _mm256_sub_ps( A, B ));
            }
        }
    }
    free(xi); 
    free(xj);
    return _res;
}


float* pairwise_sub_blocks( const float* _x, int N ) 
{
    float* _res = (float*) aligned_alloc( 32, N*N*N*sizeof(float));

    #define B 8
    float cache1[B*B], cache2[B*B];

    for (int bi = 0; bi < N; bi+=B)
      for (int bj = 0; bj < N; bj+=B)
        for (int bk = 0; bk < N; bk+=B) {
        
            // load first 8x8 block in the cache
            for (int i = 0; i < B; i++)
              for (int k = 0; k < B; k++)
                cache1[B*i + k] = X(bi+i, bk+k);

            // load second 8x8 block in the cache
            for (int j = 0; j < B; j++)
              for (int k = 0; k < B; k++)
                cache2[B*j + k] = X(bj+j, bk+k);

            // compute local operations on the caches
            for (int i = 0; i < B; i++)
             for (int j = 0; j < B; j++)
              for (int k = 0; k < B; k++)
                 res(bi+i,bj+j,bk+k) = cache1[B*i + k] - cache2[B*j + k];
         }
    return _res;
}

int main() 
{
    const int N = 512;
    float* _x = (float*) malloc( N * N * sizeof(float) );
    for( int i = 0; i < N; i++)
      for( int j = 0; j < N; j++)
        X(i,j) = ((i+j*j+17*i+101) % N) / float(N);

    double best = 9e9;
    for( int i = 0; i < 5; i++)
    {
        struct timespec start, stop;
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start);
        
        //float* res = pairwise_sub_naive( _x, N );
        //float* res = pairwise_sub_better( _x, N );
        //float* res = pairwise_sub_simd( _x, N );
        float* res = pairwise_sub_blocks( _x, N );

        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &stop);

        double t = (stop.tv_sec - start.tv_sec) * 1e6 + (stop.tv_nsec - start.tv_nsec) / 1e3;    // in microseconds
        if (t < best) best = t;
        free( res );
    }
    printf("Best of 5 = %f ms\n", best / 1000);

    free( _x );
    return 0;
}

Compiled using gcc 7.3.0 gcc -Wall -O3 -mavx -msse4.1 -o test_simd test_simd.c使用 gcc 7.3.0 编译gcc -Wall -O3 -mavx -msse4.1 -o test_simd test_simd.c

Summary of timings on my machine:我的机器上的时间总结：

Implementation执行	Time时间
numpy numpy	88 ms 88 毫秒
C++ naive C++ 天真	194 ms 194 毫秒
C++ better C++更好	195 ms 195 毫秒
C++ SIMD C++ SIMD	178 ms 178 毫秒
C++ blocked C++ 阻塞	258 ms 258 毫秒
C++ blocked (gcc 8.3.1) C++ 被阻止（gcc 8.3.1）	217 ms 217 毫秒

Answer 1

As pointed out by some of the comments numpy uses SIMD in its implementation and it does not allocate memory at the point of computation.正如一些评论所指出的，numpy 在其实现中使用 SIMD，并且它不会在计算点分配 memory。 If I eliminate the memory allocation from your implementation, pre-allocating all the buffers ahead of the computation then I get a better time compared to numpy even with the scaler version(that is the one without any optimizations).如果我从您的实现中消除 memory 分配，在计算之前预先分配所有缓冲区，那么与 numpy 相比，即使使用缩放器版本（即没有任何优化的版本），我也会获得更好的时间。

Also in terms of SIMD and why your implementation does not perform much better than the scaler is because your memory access patterns are not ideal for SIMD usage - you do memcopy and you load into SIMD registers from locations that are far apart from each other - eg you fill vectors from line 0 and line 511, which might not play well with the cache or with the SIMD prefetcher.同样在 SIMD 方面，以及为什么您的实现没有比缩放器好得多的原因是因为您的 memory 访问模式不适合 SIMD 使用 - 您执行 memcopy 并从彼此相距很远的位置加载到 SIMD 寄存器中 - 例如您从第 0 行和第 511 行填充向量，这可能无法很好地与缓存或 SIMD 预取器一起使用。

There is also a mistake in how you load the SIMD registers(if I understood correctly what you're trying to compute): a 256 bit SIMD register can load 8 single-precision floating-point numbers 8 * 32 = 256 , but in your loop you jump k by "256/sizeof(float)" which is 256/4 = 64 ;您如何加载 SIMD 寄存器也有一个错误（如果我正确理解您要计算的内容）：256 位 SIMD 寄存器可以加载 8 个单精度浮点数8 * 32 = 256 ，但在您的循环你跳 k 由"256/sizeof(float)"这是256/4 = 64 ; _x and _res are float pointers and the SIMD intrinsics expect also float pointers as arguments so instead of reading all elements from those lines every 8 floats you read them every 64 floats. _x和_res是浮点指针，SIMD 内在函数也期望浮点指针为 arguments，因此不是每 8 个浮点数从这些行中读取所有元素，而是每 64 个浮点数读取它们。

The computation can be optimized further by changing the access patterns but also by observing that you repeat some computations: eg when iterating with line0 as a base you compute line0 - line1 but at some future time, when iterating with line1 as a base, you need to compute line1 - line0 which is basically -(line0 - line1) , that is for each line after line0 a lot of results could be reused from previous computations.可以通过更改访问模式以及观察您重复一些计算来进一步优化计算：例如，当以line0作为基础进行迭代时，您计算line0 - line1但在未来某个时间，当以line1作为基础进行迭代时，您需要计算line1 - line0基本上是-(line0 - line1) ，即对于line0之后的每一行，可以从以前的计算中重用很多结果。 A lot of times SIMD usage or parallelization requires one to change how data is accessed or reasoned about in order to provide meaningful improvements.很多时候 SIMD 的使用或并行化需要改变数据的访问或推理方式，以提供有意义的改进。

Here is what I have done as a first step based on your initial implementation and it is faster than the numpy(don't mind the OpenMP stuff as it's not how its supposed to be done, I just wanted to see how it behaves trying the naive way).这是我根据您的初始实现作为第一步所做的事情，它比 numpy 更快（不要介意 OpenMP 的东西，因为它不是应该如何完成的，我只是想看看它在尝试天真的方式）。

C++
Time scaler version: 55 ms
Time SIMD version: 53 ms
**Time SIMD 2 version: 33 ms**
Time SIMD 3 version: 168 ms
Time OpenMP version: 59 ms

Python numpy
>> best of 5 = 88.794 ms


#include <cstdlib>
#include <xmmintrin.h>   // compile with -mavx -msse4.1
#include <pmmintrin.h>
#include <immintrin.h>

#include <numeric>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <cstring>

using namespace std;

float* pairwise_sub_naive (const float* input, float* output, int n) 
{
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            for (int k = 0; k < n; k++)
                output[(i * n + j) * n + k] = input[i * n + k] - input[j * n + k];
          }
    }
    return output;
}

float* pairwise_sub_simd (const float* input, float* output, int n) 
{    
    for (int i = 0; i < n; i++) 
    {
        const int idxi = i * n;
        for (int j = 0; j < n; j++)
        {
            const int idxj = j * n;
            const int outidx = idxi + j;
            for (int k = 0; k < n; k += 8) 
            {
                __m256 A = _mm256_load_ps(input + idxi + k);
                __m256 B = _mm256_load_ps(input + idxj + k);
                _mm256_store_ps(output + outidx * n + k, _mm256_sub_ps( A, B ));
            }
        }
    }
    
    return output;
}

float* pairwise_sub_simd_2 (const float* input, float* output, int n) 
{
    float* line_buffer = (float*) aligned_alloc(32, n * sizeof(float));

    for (int i = 0; i < n; i++) 
    {
        const int idxi = i * n;
        for (int j = 0; j < n; j++)
        {
            const int idxj = j * n;
            const int outidx = idxi + j;
            for (int k = 0; k < n; k += 8) 
            {
                __m256 A = _mm256_load_ps(input + idxi + k);
                __m256 B = _mm256_load_ps(input + idxj + k);
                _mm256_store_ps(line_buffer + k, _mm256_sub_ps( A, B ));
            }
            memcpy(output + outidx * n, line_buffer, n);
        }
    }
    
    return output;
}

float* pairwise_sub_simd_3 (const float* input, float* output, int n) 
{    
    for (int i = 0; i < n; i++) 
    {
        const int idxi = i * n;
        for (int k = 0; k < n; k += 8) 
        {
            __m256 A = _mm256_load_ps(input + idxi + k);
            for (int j = 0; j < n; j++)
            {
                const int idxj = j * n;
                const int outidx = (idxi + j) * n;
                __m256 B = _mm256_load_ps(input + idxj + k);
                _mm256_store_ps(output + outidx + k, _mm256_sub_ps( A, B     ));
             }
        }
    }

    return output;
}

float* pairwise_sub_openmp (const float* input, float* output, int n)
{
    int i, j;
    #pragma omp parallel for private(j)
    for (i = 0; i < n; i++) 
    {
        for (j = 0; j < n; j++)
        {
            const int idxi = i * n; 
            const int idxj = j * n;
            const int outidx = idxi + j;
            for (int k = 0; k < n; k += 8) 
            {
                __m256 A = _mm256_load_ps(input + idxi + k);
                __m256 B = _mm256_load_ps(input + idxj + k);
                _mm256_store_ps(output + outidx * n + k, _mm256_sub_ps( A, B ));
            }
        }
    }
    /*for (i = 0; i < n; i++) 
    {
        for (j = 0; j < n; j++) 
        {
            for (int k = 0; k < n; k++)
            {
                output[(i * n + j) * n + k] = input[i * n + k] - input[j * n + k];
            }
        }
    }*/
    
    return output;
}

int main ()
{
    constexpr size_t n = 512;
    constexpr size_t input_size = n * n;
    constexpr size_t output_size = n * n * n;

    float* input = (float*) aligned_alloc(32, input_size * sizeof(float));
    float* output = (float*) aligned_alloc(32, output_size * sizeof(float));

    float* input_simd = (float*) aligned_alloc(32, input_size * sizeof(float));
    float* output_simd = (float*) aligned_alloc(32, output_size * sizeof(float));

    float* input_par = (float*) aligned_alloc(32, input_size * sizeof(float));
    float* output_par = (float*) aligned_alloc(32, output_size * sizeof(float));

    iota(input, input + input_size, float(0.0));
    fill(output, output + output_size, float(0.0));

    iota(input_simd, input_simd + input_size, float(0.0));
    fill(output_simd, output_simd + output_size, float(0.0));
    
    iota(input_par, input_par + input_size, float(0.0));
    fill(output_par, output_par + output_size, float(0.0));

    std::chrono::milliseconds best_scaler{100000};
    for (int i = 0; i < 5; ++i)
    {
        auto start = chrono::high_resolution_clock::now();
        pairwise_sub_naive(input, output, n);
        auto stop = chrono::high_resolution_clock::now();

        auto duration = chrono::duration_cast<chrono::milliseconds>(stop - start);
        if (duration < best_scaler)
        {
            best_scaler = duration;
        }
    }
    cout << "Time scaler version: " << best_scaler.count() << " ms\n";

    std::chrono::milliseconds best_simd{100000};
for (int i = 0; i < 5; ++i)
{
    auto start = chrono::high_resolution_clock::now();
    pairwise_sub_simd(input_simd, output_simd, n);
    auto stop = chrono::high_resolution_clock::now();

    auto duration = chrono::duration_cast<chrono::milliseconds>(stop - start);
     if (duration < best_simd)
    {
        best_simd = duration;
    }
}
cout << "Time SIMD version: " << best_simd.count() << " ms\n";

std::chrono::milliseconds best_simd_2{100000};
for (int i = 0; i < 5; ++i)
{
    auto start = chrono::high_resolution_clock::now();
    pairwise_sub_simd_2(input_simd, output_simd, n);
    auto stop = chrono::high_resolution_clock::now();

    auto duration = chrono::duration_cast<chrono::milliseconds>(stop - start);
     if (duration < best_simd_2)
    {
        best_simd_2 = duration;
    }
}
cout << "Time SIMD 2 version: " << best_simd_2.count() << " ms\n";

std::chrono::milliseconds best_simd_3{100000};
for (int i = 0; i < 5; ++i)
{
    auto start = chrono::high_resolution_clock::now();
    pairwise_sub_simd_3(input_simd, output_simd, n);
    auto stop = chrono::high_resolution_clock::now();

    auto duration = chrono::duration_cast<chrono::milliseconds>(stop - start);
     if (duration < best_simd_3)
    {
        best_simd_3 = duration;
    }
}
cout << "Time SIMD 3 version: " << best_simd_3.count() << " ms\n";

    std::chrono::milliseconds best_par{100000};
    for (int i = 0; i < 5; ++i)
    {
        auto start = chrono::high_resolution_clock::now();
        pairwise_sub_openmp(input_par, output_par, n);
        auto stop = chrono::high_resolution_clock::now();

        auto duration = chrono::duration_cast<chrono::milliseconds>(stop - start);
         if (duration < best_par)
        {
            best_par = duration;
        }
    }
    cout << "Time OpenMP version: " << best_par.count() << " ms\n";

    cout << "Verification\n";
    if (equal(output, output + output_size, output_simd))
    {
        cout << "PASSED\n";
    }
    else
    {
        cout << "FAILED\n";
    }

    return 0;
}

Edit: Small correction as there was a wrong call related to the second version of SIMD implementation.编辑：小的更正，因为与 SIMD 实现的第二个版本相关的调用错误。

As you can see now, the second implementation is the fastest as it behaves the best from the point of view of the locality of reference of the cache.正如您现在所看到的，第二个实现是最快的，因为从缓存的引用位置的角度来看，它表现得最好。 Examples 2 and 3 of SIMD implementations are there to illustrate for you how changing memory access patterns to influence the performance of your SIMD optimizations. SIMD 实现的示例 2 和 3 为您说明如何更改 memory 访问模式以影响 SIMD 优化的性能。 To summarize(knowing that I'm far from being complete in my advice) be mindful of your memory access patterns and of the loads and stores to\from the SIMD unit;总结一下（知道我的建议远非完整）请注意您的 memory 访问模式以及从 SIMD 单元进行的加载和存储； the SIMD is a different hardware unit inside the processor's core so there is a penalty in shuffling data back and forth, hence when you load a register from memory try to do as many operations as possible with that data and do not be too eager to store it back(of course, in your example that might be all you need to do with the data). SIMD 是处理器内核内部的不同硬件单元，因此在来回混洗数据时会受到惩罚，因此当您从 memory 加载寄存器时，请尝试使用该数据执行尽可能多的操作并且不要太急于存储它回来了（当然，在你的例子中，这可能是你需要对数据做的所有事情）。 Be mindful also that there is a limited number of SIMD registers available and if you load too many then they will "spill", that is they will be stored back to temporary locations in main memory behind the scenes killing all your gains.还要注意，可用的 SIMD 寄存器数量有限，如果加载太多，它们会“溢出”，也就是说，它们将被存储回主 memory 的临时位置，在幕后杀死你所有的收益。 SIMD optimization, it's a true balance act! SIMD 优化，这是一个真正的平衡行为！

There is some effort to put a cross-platform intrinsics wrapper into the standard(I developed myself a closed source one in my glorious past) and even it's far from being complete, it's worth taking a look at(read the accompanying papers if you're truly interested to learn how SIMD works).将跨平台内在函数包装器放入标准中需要付出一些努力（在我辉煌的过去，我自己开发了一个封闭源代码），即使它还远未完成，值得一看（如果你'请阅读随附的论文'真正有兴趣了解 SIMD 的工作原理）。 https://github.com/VcDevel/std-simd https://github.com/VcDevel/std-simd

Answer 2

This is a complement to the answer posted by @celakev.这是对@celakev 发布的答案的补充。 I think I finally got to understand what exactly was the issue.我想我终于明白到底是什么问题了。 The issue was not about allocating the memory in the main function that does the computation.问题不在于在进行计算的主 function 中分配 memory。

What was actually taking time is to access new (fresh) memory .实际上需要时间的是访问新的（新鲜的） memory 。 I believe that the malloc call returns pages of memory which are virtual, ie that does not corresponds to actual physical memory -- until it is explicitly accessed.我相信malloc调用返回 memory 的页面，这些页面是虚拟的，即与实际物理 memory 不对应——直到它被显式访问。 What actually takes time is the process of allocating physical memory on the fly (which I think is OS-level) when it is accessed in the function code.实际上需要时间的是在 function 代码中访问时动态分配物理 memory（我认为是操作系统级别）的过程。

Here is a proof.这是一个证明。 Consider the two following trivial functions:考虑以下两个平凡函数：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

float* just_alloc( size_t N ) 
{
    return (float*) aligned_alloc( 32, sizeof(float)*N );
}

void just_fill( float* _arr, size_t N ) 
{
    for (size_t i = 0; i < N; i++)
        _arr[i] = 1;
}

#define Time( code_to_benchmark, cleanup_code ) \
    do { \
        double best = 9e9; \
        for( int i = 0; i < 5; i++) { \
            struct timespec start, stop; \
            clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start); \
            code_to_benchmark; \
            clock_gettime(CLOCK_THREAD_CPUTIME_ID, &stop); \
            double t = (stop.tv_sec - start.tv_sec) * 1e3 + (stop.tv_nsec - start.tv_nsec) / 1e6; \
            printf("Time[%d] = %f ms\n", i, t); \
            if (t < best) best = t; \
            cleanup_code; \
        } \
        printf("Best of 5 for '" #code_to_benchmark "' = %f ms\n\n", best); \
    } while(0)

int main() 
{
    const size_t N = 512;

    Time( float* arr = just_alloc(N*N*N), free(arr) );
    
    float* arr = just_alloc(N*N*N);
    Time( just_fill(arr, N*N*N), ; );
    free(arr);

    return 0;
}

I get the following timings, which I now detail for each of the calls:我得到以下时间，我现在详细说明每个呼叫：

Time[0] = 0.000931 ms
Time[1] = 0.000540 ms
Time[2] = 0.000523 ms
Time[3] = 0.000524 ms
Time[4] = 0.000521 ms
Best of 5 for 'float* arr = just_alloc(N*N*N)' = 0.000521 ms

Time[0] = 189.822237 ms
Time[1] = 45.041083 ms
Time[2] = 46.331428 ms
Time[3] = 44.729433 ms
Time[4] = 42.241279 ms
Best of 5 for 'just_fill(arr, N*N*N)' = 42.241279 ms

As you can see, allocating memory is blazingly fast, but the first time that the memory is accessed, it is 5 times slower than the other times.如您所见，分配 memory 非常快，但第一次访问 memory 时，它比其他时间慢 5 倍。 So, basically the reason that my code was slow was because i was each time reallocating fresh memory that had no physical address yet.所以，基本上我的代码很慢的原因是因为我每次都重新分配还没有物理地址的新 memory。 (Correct me if I'm wrong but I think that's the gist of it!) （如果我错了，请纠正我，但我认为这就是它的要点！）

Answer 3

A bit late to the party, but I wanted to add a pairwise method with Eigen , which is supposed to give C++ a high-level algebra manipulation capability and use SIMD under the hood.派对有点晚了，但我想添加一个带有Eigen的pairwise方法，它应该给 C++ 一个高级代数操作能力，并在引擎盖下使用 SIMD。 Just like numpy.就像 numpy 一样。

Here is the implementation这是实现

#include <iostream>
#include <vector>
#include <chrono>
#include <algorithm>
        
#include <Eigen/Dense>

auto pairwise_eigen(const Eigen::MatrixXf &input, std::vector<Eigen::MatrixXf> &output) {
        for (int k = 0; k < input.cols(); ++k)
                output[k] = input
                          // subtract matrix with repeated k-th column
                          - input.col(k) * Eigen::RowVectorXf::Ones(input.cols());

}

int main() {
        constexpr size_t n = 512;
        
        // allocate input and output 
        Eigen::MatrixXf input = Eigen::MatrixXf::Random(n, n);
        std::vector<Eigen::MatrixXf> output(n);

        std::chrono::milliseconds best_eigen{100000};
        for (int i = 0; i < 5; ++i) {
                auto start = std::chrono::high_resolution_clock::now();
                pairwise_eigen(input, output);
                auto end = std::chrono::high_resolution_clock::now();
         
                auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end-start);      
                if (duration < best_eigen)
                        best_eigen = duration;
        }

        std::cout << "Time Eigen version: " << best_eigen.count() << " ms\n";

        return 0;
}

The full benchmark tests suggested by @celavek on my system are @celavek 在我的系统上建议的完整基准测试是

Time scaler version: 57 ms
Time SIMD version: 58 ms
Time SIMD 2 version: 40 ms
Time SIMD 3 version: 58 ms
Time OpenMP version: 58 ms

Time Eigen version: 76 ms

Numpy >> best of 5 = 118.489 ms

Whit Eigen there is still a noticeable improvement with respect to Numpy, but not so impressive compared to the "raw" implementations (there is certainly some overhead). Whit Eigen 相对于 Numpy 仍有显着改进，但与“原始”实现相比并没有那么令人印象深刻（肯定有一些开销）。 An extra optimization is to allocate the output vector with copies of the input and then subtract directly from each vector entry, simply replacing the following lines一个额外的优化是使用输入的副本分配 output 向量，然后直接从每个向量条目中减去，只需替换以下行

// inside the pairwise method
        for (int k = 0; k < input.cols(); ++k)
                output[k] -= input.col(k) * Eigen::RowVectorXf::Ones(input.cols());


// at allocation time
        std::vector<Eigen::MatrixXf> output(n, input);

This pushes the best of 5 down to 60 ms.这将最好的 5 推低到 60 毫秒。

numpy怎么这么快？

问题描述

3 个解决方案

解决方案1
31 2021-01-26 22:21:36

解决方案2
9 已采纳 2021-01-27 10:41:40

解决方案3
1 2021-02-03 20:34:24

numpy怎么这么快？

问题描述

3 个解决方案

解决方案1 31 2021-01-26 22:21:36

解决方案2 9 已采纳 2021-01-27 10:41:40

解决方案3 1 2021-02-03 20:34:24

解决方案1
31 2021-01-26 22:21:36

解决方案2
9 已采纳 2021-01-27 10:41:40

解决方案3
1 2021-02-03 20:34:24