为什么c ++代码实现的性能不如python实现？

Question

I am doing benchmarking for finding nearest neighbour for the datapoints. 我正在做基准测试以找到数据点的最近邻居。 My c++ implementation and python implementation are taking almost same execution time. 我的c ++实现和python实现几乎相同的执行时间。 Shouldn't be c++ works better than the raw python implementation. 不应该是c ++比原始python实现更好。

C++ Execution Time : 8.506 seconds C ++执行时间：8.506秒
Python Execution Time : 8.7202 seconds Python执行时间：8.7202秒

C++ Code: C ++代码：

#include <iostream>
#include <random>
#include <map>
#include <cmath>
#include <numeric> 
#include <algorithm>
#include <chrono>
#include <vector>     // std::iota

using namespace std;
using namespace std::chrono;

double edist(double* arr1, double* arr2, uint n) {
    double sum = 0.0;
    for (int i=0; i<n; i++) {
        sum += pow(arr1[i] - arr2[i], 2);
    }
    return sqrt(sum); }

template <typename T> vector<size_t> argsort(const vector<T> &v) {
  // initialize original index locations
  vector<size_t> idx(v.size());   iota(idx.begin(), idx.end(), 0);

  // sort indexes based on comparing values in v
  sort(idx.begin(), idx.end(),
       [&v](size_t i1, size_t i2) {return v[i1] < v[i2];});

  return std::vector<size_t>(idx.begin() + 1, idx.end()); }

int main() {

    uint N, M;
    // cin >> N >> M;
    N = 1000;
    M = 800;
    double **arr = new double*[N];
    std::random_device rd; // obtain a random number from hardware
    std::mt19937 eng(rd()); // seed the generator
    std::uniform_real_distribution<> distr(10.0, 60.0);

    for (int i = 0; i < N; i++) {
        arr[i] = new double[M];
        for(int j=0; j < M; j++) {
            arr[i][j] = distr(eng);
        }
    }
    auto start = high_resolution_clock::now();
    map<int, vector<size_t> > dist;

    for (int i=0; i<N; i++) {
        vector<double> distances;
        for(int j=0; j<N; j++) {
            distances.push_back(edist(arr[i], arr[j], N));
        }
        dist[i] = argsort(distances);
    }
    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>(stop-start);
    int dur = duration.count();
    cout<<"Time taken by code: "<<dur<<" microseconds"<<endl;
    cout<<" In seconds: "<<dur/pow(10,6);  
        return 0; }

Python Code: Python代码：

import time
import numpy as np
def comp_inner_raw(i, x):
    res = np.zeros(x.shape[0], dtype=np.float64)
    for j in range(x.shape[0]):
        res[j] = np.sqrt(np.sum((i-x[j])**2))
    return res
def nearest_ngbr_raw(x): # x = [[1,2,3],[4,5,6],[7,8,9]]
    #print("My array: ",x)
    dist = {}
    for idx,i in enumerate(x):
        #lst = []
        lst = comp_inner_raw(i,x)
        s = np.argsort(lst)#[1:]
        sorted_array = np.array(x)[s][1:]
        dist[idx] = s[1:]
    return dist
arr = np.random.rand(1000, 800)
start = time.time()
table = nearest_ngbr_raw(arr)
print("Time taken to execute the code using raw python is {}".format(time.time()-start))

Compile Command: 编译命令：

 g++ -std=c++11 knn.cpp -o knn

C++ compiler(g++) version for ubuntu 18.04.1: 7.4.0 适用于ubuntu 18.04.1的C ++编译器（g ++）版本： 7.4.0

Coded in c++11 用c ++ 11编写

Numpy version : 1.16.2 Numpy版本 ：1.16.2

Edit Tried with compiler optimization, now it is taking around 1 second. 编辑尝试使用编译器优化，现在需要大约1秒钟。 Can this c++ code be optimized further from coding or any other perspective? 这个c ++代码可以从编码或任何其他角度进一步优化吗？

Answer 1

Can this c++ code be optimized further from coding or any other perspective? 这个c ++代码可以从编码或任何其他角度进一步优化吗？

I can see at least three optimisations. 我可以看到至少三个优化。 The first two are easy and should definitely be done but in my testing they end up not impacting the runtime measurably. 前两个很容易，绝对应该完成，但在我的测试中，它们最终不会对运行时产生可测量的影响。 The third one requires rethinking the code minimally. 第三个需要最低限度地重新思考代码。

edist caculates a costly square root, but you are only using the distance for pairwise comparison. edist计算一个昂贵的平方根，但你只使用距离进行成对比较。 Since the square root function is monotonically increasing, it has no impact on the comparison result. 由于平方根函数单调递增，因此对比较结果没有影响。 Similarly, pow(x, 2) can be replaced with x * x and this is sometimes faster: 类似地， pow(x, 2)可以用x * x替换，这有时更快：
```
 double edist(std::vector<double> const& arr1, std::vector<double> const& arr2, uint n) { double sum = 0.0; for (unsigned int i = 0; i < n; i++) { auto const diff = arr1[i] - arr2[i]; sum += diff * diff; } return sum; } 
```
argsort performs a copy because it returns the indices excluding the first element. argsort执行副本，因为它返回除第一个元素之外的索引。 If you instead include the first element (change the return statement to return idx; ), you avoid a potentially costly copy. 如果您改为包含第一个元素（更改return语句以return idx; ），则可以避免可能代价高昂的副本。
Your matrix is represented as a nested array (and you're for some reason using raw pointers instead of a nested std::vector ). 你的矩阵表示为一个嵌套数组（你出于某种原因使用原始指针而不是嵌套的std::vector ）。 It's generally more efficient to represent matrices as contiguous N*M arrays: std::vector<double> arr(N * M); 将矩阵表示为连续的N * M数组通常更有效： std::vector<double> arr(N * M); . 。 This is also how numpy represents matrices internally. 这也是numpy在内部表示矩阵的方式。 This requires changing the code to calculate the indices. 这需要更改代码来计算索引。

为什么c ++代码实现的性能不如python实现？

问题描述

C++ Code: C ++代码：

Python Code: Python代码：

Compile Command: 编译命令：

1 个解决方案

解决方案1
3 已采纳 2019-07-16 14:50:05

为什么c ++代码实现的性能不如python实现？

问题描述

C++ Code: C ++代码：

Python Code: Python代码：

Compile Command: 编译命令：

1 个解决方案

解决方案1 3 已采纳 2019-07-16 14:50:05

解决方案1
3 已采纳 2019-07-16 14:50:05