简体   繁体   English

使用GPU计算Python / Numba中N个最近的城市的更好方法

[英]Better way to compute Top N closest cities in Python/Numba using GPU

I have M~200k points with X,Y coordinates of cities packed in a Mx2 numpy array. 我有M〜200k点,其中X,Y座标为Mx2 numpy数组。 Intent is, for each city to compute Top N closest cities and return their indices and distances to that city in a MxN numpy matrix. 目的是,为每个城市计算N个最接近的城市,并以MxN numpy矩阵的形式返回到该城市的指数和距离。 Numba does a great job speeding up my serial python code on CPU and making it multithreaded using prange generator, such that on a 16 cores SSE 4.2/AVX machine call to compute 30 closest cities completes in 6min 26s while saturating all the cores: Numba出色地完成了我在CPU上的串行python代码的加速工作,并使用prange生成器使其成为多线程,从而在16核SSE 4.2 / AVX机器上调用以计算30个最近的城市在6min 26s内完成,同时饱和了所有核:

@numba.njit(fastmath=True)#
def get_closest(n,N_CLOSEST,coords,i):
    dist=np.empty(n,np.float32)
    for j in range(n):
        dist[j]=(coords[i,0]-coords[j,0])**2+(coords[i,1]-coords[j,1])**2
    indices=np.argsort(dist)[1:N_CLOSEST+1]
    return indices,dist[indices]

@numba.njit(fastmath=True,parallel=True)
def get_top_T_closest_cities(coords,T):
    n=len(coords)

    N_CLOSEST=min(T,n-1)

    closest_distances=np.empty((n,N_CLOSEST),np.float32)
    closest_cities=np.empty((n,N_CLOSEST),np.int32)

    for i in prange(n):
        closest_cities[i,:],closest_distances[i,:]=get_closest(n,N_CLOSEST,coords,i)
    return closest_cities,closest_distances

closest_cities,closest_distances=get_top_T_closest_cities(data,30)

Note: I wanted to use mrange=np.arange(N_CLOSEST) and indices=np.argpartition(dist,mrange) later to save some cycles, but unfortunately Numba did not support np.argpartition yet. 注意:我想稍后使用mrange = np.arange(N_CLOSEST)和index = np.argpartition(dist,mrange)来节省一些周期,但是不幸的是Numba尚不支持np.argpartition。

Then I decided to put my newly bought RTX 2070 to some good use and try offloading these very parallel by the nature computations to the GPU, using again Numba and possibly CuPy. 然后,我决定充分利用新购买的RTX 2070,并尝试通过自然计算将这些并行处理卸载到GPU,再次使用Numba并可能使用CuPy。

After some thinking I came up with a relatively dumb rewrite, where GPU kernel handling one city at time was called consecutively for each of M cities. 经过一番思考后,我想出了一个相对愚蠢的重写方法,即对M个城市中的每个城市连续调用一次处理一个城市的GPU内核。 In that kernel, each GPU thread was computing its distance and saving to specific location in dist array. 在该内核中,每个GPU线程都在计算其距离并保存到dist数组中的特定位置。 All arrays were allocated on device to minimize PCI data transfer: 所有阵列都分配在设备上,以最大程度地减少PCI数据传输:

import cupy as cp
def get_top_T_closest_cities_gpu(coords,T):
    n=len(coords)

    N_CLOSEST=min(T,n-1)

    closest_distances=cp.empty((n,N_CLOSEST),cp.float32)
    closest_cities=cp.empty((n,N_CLOSEST),cp.int32)
    device_coords=cp.asarray(coords)
    dist=cp.ndarray(n,cp.float32)
    for i in range(n):
        if (i % 1000)==0:
            print(i)
        closest_cities[i,:],closest_distances[i,:]=get_closest(n,N_CLOSEST,device_coords,i,dist)
    return cp.asnumpy(closest_cities),cp.asnumpy(closest_distances)
@cuda.jit()
def get_distances(coords,i,n,dist):
    stride = cuda.gridsize(1)
    start = cuda.grid(1)    
    for j in range(start, n, stride):
        dist[j]=(coords[i,0]-coords[j,0])**2+(coords[i,1]-coords[j,1])**2
def get_closest(n,N_CLOSEST,coords,i,dist):    
    get_distances[512,512](coords,i,n,dist)
    indices=cp.argsort(dist)[1:N_CLOSEST+1]
    return indices,dist[indices]

Now, computing results on GPU took almost the same 6 mins time, but GPU load was barely 1% (yes I made sure results returned by CPU and GPU versions are the same). 现在,GPU上的计算结果花费了几乎相同的6分钟时间,但是GPU负载仅为1%(是的,我确保CPU和GPU版本返回的结果是相同的)。 I played a bit with block size, did not see any significant change. 我玩了一些块大小,没有看到任何重大变化。 Interestingly, cp.argsort and get_distances were consuming approximately the same share of processing time. 有趣的是, cp.argsortget_distances消耗了大约相同的处理时间份额。

I feel like it has something to do with streams, how to properly init a lot of them? 我觉得它与流有关,如何正确初始化其中的许多流? This would be most straightforward way to reuse my code, by treating not 1 city at time but let's say 16 or whatever my compute capability allows, ideally 1000 maybe. 这是重用我的代码的最直接的方法,一次不处理1个城市,而是说16个城市或任何我的计算能力允许的水平,最好是1000个。

What would those of you guys experienced in GPU coding in Numba/CuPy recommend to fully exploit power of GPU in my case? 在Numba / CuPy中具有GPU编码经验的你们中的哪些人会建议我充分利用GPU的功能?

Advice from pure C++ CUDA apologists would also be very welcome, as I haven't seen comparison of native Cuda solution with Numba/CuPy CUDA solution. 也欢迎来自纯C ++ CUDA辩护者的建议,因为我还没有看到将本地Cuda解决方案与Numba / CuPy CUDA解决方案进行比较。

Python version: ['3.6.3 |Anaconda, Inc'] platform: AMD64 system: Windows-10-10.0.14393-SP0 Numba version: 0.41.0 Python版本:['3.6.3 | Anaconda,Inc']平台:AMD64系统:Windows-10-10.0.14393-SP0 Numba版本:0.41.0

You seem to be interested in an answer based on CUDA C++, so I'll provide one. 您似乎对基于CUDA C ++的答案感兴趣,因此我将提供一个答案。 It should be straightforward to convert this kernel to an equivalent numba cuda jit kernel; 将这个内核转换为等效的numba cuda jit内核应该很简单; the translation process is fairly mechanical. 翻译过程是相当机械的。

An outline of the method I chose is as follows: 我选择的方法的概述如下:

  • assign one city per thread (the reference city for that thread) 每个线程分配一个城市(该线程的参考城市)
  • each thread traverses through all the cities computing the pairwise distance to its reference city 每个线程遍历所有城市,计算到其参考城市的成对距离
  • as each distance is computed, it is checked to see if it is within the current "closest cities". 在计算每个距离时,将检查它是否在当前的“最近的城市”之内。 The approach here is to keep a running list for each thread. 这里的方法是为每个线程保留一个运行列表。 As each thread computes a new distance, it checks to see if that distance is less than the current maximum distance in the running list. 当每个线程计算一个新的距离时,它会检查该距离是否小于运行列表中的当前最大距离。 If it is, the current maximum distance/city is replaced with the one just computed. 如果是,则将当前的最大距离/城市替换为刚才计算的距离/城市。
  • The list of "closest cities" along with their distances is maintained in shared memory. “最近的城市”列表及其距离将保留在共享内存中。
  • At the completion of computing all distances, each thread then sorts and writes out the values in its shared buffer to global memory 完成所有距离的计算后,每个线程将进行排序并将其共享缓冲区中的值写到全局内存中
  • There are several "optimizations" to this basic plan. 此基本计划有多个“优化”。 One is that each warp reads 32 consecutive values at a time, and then uses a shuffle operation to pass the values around among threads, so as to reduce global memory traffic. 一种是每个warp一次读取32个连续的值,然后使用随机操作在线程之间传递这些值,以减少全局内存流量。

A single kernel call performs the entire operation for all cities. 单个内核调用将对所有城市执行整个操作。

The performance of this kernel will vary quite a bit depending on the GPU you are running it on. 该内核的性能将根据您在其上运行的GPU的不同而有很大差异。 For example on a tesla P100 it runs in about 0.5s. 例如,在特斯拉P100上,它的运行时间约为0.5s。 On a Tesla K40 it takes ~6s. 在Tesla K40上,大约需要6秒。 On a Quadro K2000 it takes ~40s. 在Quadro K2000上,大约需要40秒。

Here is a full test case on the Tesla P100: 这是Tesla P100的完整测试案例:

$ cat t364.cu
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <algorithm>
#include <vector>
#include <thrust/sort.h>

const int ncities = 200064;  // should be divisible by nTPB
const int nclosest = 32;     // maximum 48
const int nTPB = 128;        // should be multiple of 32
const float FLOAT_MAX = 999999.0f;


__device__ void init_shared(float *d, int n){
  int idx = threadIdx.x;
  for (int i = 0; i < n; i++){
    d[idx] = FLOAT_MAX;
    idx += nTPB;}
}

__device__ int find_max(float *d, int n){
  int idx = threadIdx.x;
  int max = 0;
  float maxd = d[idx];
  for (int i = 1; i < n; i++){
    idx += nTPB;
    float next = d[idx];
    if (maxd < next){
      max = i;
      maxd = next;}}
  return max;
}

__device__ float compute_sqdist(float2 c1, float2 c2){
  return (c1.x - c2.x)*(c1.x - c2.x) + (c1.y - c2.y)*(c1.y - c2.y);
}

__device__ void sort_and_store_sqrt(int idx, float *closest_dist, int *closest_id, float *dist,  int *city, int n, int n_cities)
{
  for (int i = n-1; i >= 0; i--){
    int max = find_max(dist, n);
    closest_dist[idx + i*n_cities] = sqrtf(dist[max*nTPB+threadIdx.x]);
    closest_id[idx + i*n_cities] = city[max*nTPB+threadIdx.x];
    dist[max*nTPB+threadIdx.x] = 0.0f;}
};


__device__ void shuffle(int &city_id, float2 &next_city_xy, int my_warp_lane){

  int src_lane = (my_warp_lane+1)&31;
  city_id = __shfl_sync(0xFFFFFFFF, city_id, src_lane);
  next_city_xy.x = __shfl_sync(0xFFFFFFFF, next_city_xy.x, src_lane);
  next_city_xy.y = __shfl_sync(0xFFFFFFFF, next_city_xy.y, src_lane);
}

__global__ void k(const float2 * __restrict__ cities, float * __restrict__ closest_dist, int * __restrict__ closest_id, const int n_cities){

  __shared__  float dist[nclosest*nTPB];
  __shared__  int   city[nclosest*nTPB];

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int my_warp_lane = (threadIdx.x & 31);
  init_shared(dist, nclosest);
  float2 my_city_xy = cities[idx];
  float last_max = FLOAT_MAX;
  for (int i = 0; i < n_cities; i+=32){
    int city_id = i+my_warp_lane;
    float2 next_city_xy = cities[city_id];
    for (int j = 0; j < 32; j++){
      if (j > 0) shuffle(city_id, next_city_xy, my_warp_lane);
      if (idx != city_id){
        float my_dist = compute_sqdist(my_city_xy, next_city_xy);
        if (my_dist < last_max){
          int max = find_max(dist, nclosest);
          last_max = dist[max*nTPB+threadIdx.x];
          if (my_dist < last_max) {
            dist[max*nTPB+threadIdx.x] = my_dist;
            city[max*nTPB+threadIdx.x] = city_id;}}}}}
  sort_and_store_sqrt(idx, closest_dist, closest_id, dist, city, nclosest, n_cities);
}

void on_cpu(float2 *cities, float *dists, int *ids){
  thrust::host_vector<int> my_ids(ncities);
  thrust::host_vector<float> my_dists(ncities);
  for (int i = 0; i < ncities; i++){
    for (int j = 0; j < ncities; j++){
      my_ids[j] = j;
      if (i != j) my_dists[j] = sqrtf((cities[i].x - cities[j].x)*(cities[i].x - cities[j].x) + (cities[i].y - cities[j].y)*(cities[i].y - cities[j].y));
      else my_dists[j] = FLOAT_MAX;}
    thrust::sort_by_key(my_dists.begin(), my_dists.end(), my_ids.begin());
    for (int j = 0; j < nclosest; j++){
      dists[i+j*ncities] = my_dists[j];
      ids[i+j*ncities] = my_ids[j];}
    }
}


int main(){

  float2 *h_cities, *d_cities;
  float *h_closest_dist, *d_closest_dist, *h_closest_dist_cpu;
  int *h_closest_id, *d_closest_id, *h_closest_id_cpu;

  cudaMalloc(&d_cities, ncities*sizeof(float2));
  cudaMalloc(&d_closest_dist, nclosest*ncities*sizeof(float));
  cudaMalloc(&d_closest_id, nclosest*ncities*sizeof(int));

  h_cities = new float2[ncities];
  h_closest_dist = new float[ncities*nclosest];
  h_closest_id = new int[ncities*nclosest];
  h_closest_dist_cpu = new float[ncities*nclosest];
  h_closest_id_cpu = new int[ncities*nclosest];

  for (int i = 0; i < ncities; i++){
    h_cities[i].x = (rand()/(float)RAND_MAX)*10.0f;
    h_cities[i].y = (rand()/(float)RAND_MAX)*10.0f;}

  cudaMemcpy(d_cities, h_cities, ncities*sizeof(float2), cudaMemcpyHostToDevice);
  k<<<ncities/nTPB, nTPB>>>(d_cities, d_closest_dist, d_closest_id, ncities);
  cudaMemcpy(h_closest_dist, d_closest_dist, ncities*nclosest*sizeof(float),cudaMemcpyDeviceToHost);
  cudaMemcpy(h_closest_id, d_closest_id, ncities*nclosest*sizeof(int),cudaMemcpyDeviceToHost);
  for (int i = 0; i < nclosest; i++){
    for (int j = 0; j < 5; j++) std::cout << h_closest_id[j+i*ncities] << ","  << h_closest_dist[j+i*ncities] << "   ";
    std::cout << std::endl;}
  if (ncities < 5000){
    on_cpu(h_cities, h_closest_dist_cpu, h_closest_id_cpu);
    for (int i = 0; i < ncities*nclosest; i++)
      if (h_closest_id_cpu[i] != h_closest_id[i]) {std::cout << "mismatch at: " << i << " was: " << h_closest_id[i] << " should be: " << h_closest_id_cpu[i] << std::endl; return -1;}
    }
  return 0;
}

$ nvcc -o t364 t364.cu
$ nvprof ./t364
==31871== NVPROF is profiling process 31871, command: ./t364
33581,0.00466936   116487,0.0163371   121419,0.0119542   138585,0.00741395   187892,0.0205568
56138,0.0106946   105637,0.0195111   175565,0.0132018   121719,0.0090809   198333,0.0269794
36458,0.014851   6942,0.0244329   67920,0.013367   140919,0.014739   91533,0.0348142
48152,0.0221216   146867,0.0250192   14656,0.020469   163149,0.0247639   162442,0.0354359
3681,0.0226946   127671,0.0265515   32841,0.0219539   109520,0.0359874   21346,0.0424094
20903,0.0313963   3075,0.0279635   79787,0.0220388   106161,0.0365807   24186,0.0453916
191226,0.0316818   4168,0.0297535   126726,0.0246147   154598,0.0374218   62106,0.0459001
185573,0.0349628   76270,0.030849   122878,0.0249695   124897,0.0447656   38124,0.0463704
71252,0.037517   18879,0.0350544   112410,0.0299296   102006,0.0462593   12361,0.0464608
172641,0.0399721   134288,0.035077   39252,0.031506   164570,0.0470057   81105,0.0502873
18960,0.0433545   53266,0.0360357   195467,0.0334281   36715,0.0470069   153375,0.0508115
163568,0.0442928   95544,0.0361058   151839,0.0398384   114041,0.0490263   8524,0.0511531
182326,0.047519   59919,0.0362906   90810,0.0408069   52013,0.0494515   16793,0.0525569
169860,0.048417   77412,0.036694   12065,0.0414496   137863,0.0494703   197500,0.0537481
40158,0.0492621   34592,0.0377113   54812,0.041594   58792,0.049532   70501,0.0541114
121444,0.0501154   102800,0.0414865   96238,0.0433548   34323,0.0531493   161527,0.0551868
118564,0.0509228   82832,0.0449587   167965,0.0439488   104475,0.0534779   66968,0.0572788
60136,0.0528873   88318,0.0455667   118562,0.0462537   129853,0.0535594   7544,0.0588875
95975,0.0587857   65792,0.0479467   124929,0.046828   116360,0.0537344   37341,0.0594454
62542,0.0592229   57399,0.0509665   186583,0.0470843   47571,0.0541045   81275,0.0596965
11499,0.0607943   28314,0.0512354   23730,0.0518801   176089,0.0543222   562,0.06527
131594,0.0611795   23120,0.0515408   25933,0.0547776   117474,0.0557752   194499,0.0657885
23959,0.0623019   137283,0.0533193   173000,0.0577509   157537,0.0566689   193091,0.0666895
5660,0.0629772   15498,0.0555821   161025,0.0596721   123148,0.0589929   147928,0.0672529
51033,0.063036   15456,0.0565314   145859,0.0601408   3012,0.0601779   107646,0.0687241
26790,0.0643055   99659,0.0576798   33153,0.0603556   48388,0.0617377   47566,0.0689055
178826,0.0655352   143209,0.058413   101960,0.0604994   22146,0.0620504   91118,0.0705487
32108,0.0672866   172089,0.058676   17885,0.0638383   137318,0.0624543   138223,0.0716578
38125,0.0678566   42847,0.0609811   10879,0.0681518   154360,0.0633921   96195,0.0723226
96583,0.0683073   169703,0.0621889   100839,0.0721133   28595,0.0661302   20422,0.0731882
98329,0.0683718   50432,0.0636618   84204,0.0733909   181919,0.066552   75375,0.0736715
75814,0.0694582   161714,0.0674298   89049,0.0749184   151275,0.0679411   37849,0.0739173
==31871== Profiling application: ./t364
==31871== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   95.66%  459.88ms         1  459.88ms  459.88ms  459.88ms  k(float2 const *, float*, int*, int)
                    4.30%  20.681ms         2  10.340ms  10.255ms  10.425ms  [CUDA memcpy DtoH]
                    0.04%  195.55us         1  195.55us  195.55us  195.55us  [CUDA memcpy HtoD]
      API calls:   58.18%  482.42ms         3  160.81ms  472.46us  470.80ms  cudaMemcpy
                   41.05%  340.38ms         3  113.46ms  322.83us  339.73ms  cudaMalloc
                    0.55%  4.5490ms       384  11.846us     339ns  503.79us  cuDeviceGetAttribute
                    0.16%  1.3131ms         4  328.27us  208.85us  513.66us  cuDeviceTotalMem
                    0.05%  407.13us         4  101.78us  93.796us  118.27us  cuDeviceGetName
                    0.01%  61.719us         1  61.719us  61.719us  61.719us  cudaLaunchKernel
                    0.00%  23.965us         4  5.9910us  3.9830us  8.9510us  cuDeviceGetPCIBusId
                    0.00%  6.8440us         8     855ns     390ns  1.8150us  cuDeviceGet
                    0.00%  3.7490us         3  1.2490us     339ns  2.1300us  cuDeviceGetCount
                    0.00%  2.0260us         4     506ns     397ns     735ns  cuDeviceGetUuid
$

CUDA 10.0, Tesla P100, CentOS 7 CUDA 10.0,Tesla P100,CentOS 7

The code includes the possibility for verification, but it verifies the results against a CPU-based naive code, which takes quite a bit longer. 该代码包括进行验证的可能性,但是它会根据基于CPU的朴素代码来验证结果,这需要花费更长的时间。 So I've restricted verification to smaller test cases, eg. 因此,我将验证仅限于较小的测试用例。 4096 cities. 4096个城市。

Here's a "translated" approximation of the above code using numba cuda. 这是使用numba cuda对上述代码进行的“翻译”近似。 It seems to run about 2x slower than the CUDA C++ version, which shouldn't really be the case. 它的运行速度似乎比CUDA C ++版本慢2倍左右,事实并非如此。 (One of the contributing factors seems to be the use of the sqrt function - it is having a significant performance impact in the numba code but no perf impact in the CUDA C++ code. It may be using a double-precision realization under the hood in numba CUDA.) Anyway it gives a reference for how it could be realized in numba: (其中一个促成因素似乎是sqrt函数的使用-它在numba代码中具有显着的性能影响,而在CUDA C ++代码中却没有性能影响。它可能在Windows的幕后使用了双精度实现。 numba CUDA。)无论如何,它为如何在numba中实现提供了参考:

import numpy as np
import numba as nb
import math
from numba import cuda,float32,int32

# number of cities, should be divisible by threadblock size
N = 200064
# number of "closest" cities
TN = 32
#number of threads per block
threadsperblock = 128
#shared memory size of each array in elements
smSize=TN*threadsperblock
@cuda.jit('int32(float32[:])',device=True)
def find_max(dist):
    idx = cuda.threadIdx.x
    mymax = 0
    maxd = dist[idx]
    for i in range(TN-1):
        idx += threadsperblock
        mynext = dist[idx]
        if maxd < mynext:
            mymax = i+1
            maxd = mynext
    return mymax

@cuda.jit('void(float32[:], float32[:], float32[:], int32[:], int32)')
def k(citiesx, citiesy, td, ti, nc):
    dist = cuda.shared.array(smSize, float32)
    city = cuda.shared.array(smSize, int32)
    idx = cuda.grid(1)
    tid = cuda.threadIdx.x
    wl = tid & 31
    my_city_x = citiesx[idx];
    my_city_y = citiesy[idx];
    last_max = 99999.0
    for i in range(TN):
        dist[tid+i*threadsperblock] = 99999.0
    for i in xrange(0,nc,32):
        city_id = i+wl
        next_city_x = citiesx[city_id]
        next_city_y = citiesy[city_id]
        for j in range(32):
            if j > 0:
                src_lane = (wl+1)&31
                city_id = cuda.shfl_sync(0xFFFFFFFF, city_id, src_lane)
                next_city_x = cuda.shfl_sync(0xFFFFFFFF, next_city_x, src_lane)
                next_city_y = cuda.shfl_sync(0xFFFFFFFF, next_city_y, src_lane)
            if idx != city_id:
                my_dist = (next_city_x - my_city_x)*(next_city_x - my_city_x) + (next_city_y - my_city_y)*(next_city_y - my_city_y)
                if my_dist < last_max:
                    mymax = find_max(dist)
                    last_max = dist[mymax*threadsperblock+tid]
                    if my_dist < last_max:
                        dist[mymax*threadsperblock+tid] = my_dist
                        city[mymax*threadsperblock+tid] = city_id
    for i in range(TN):
        mymax = find_max(dist)
        td[idx+i*nc] = math.sqrt(dist[mymax*threadsperblock+tid])
        ti[idx+i*nc] = city[mymax*threadsperblock+tid]
        dist[mymax*threadsperblock+tid] = 0

#peform test
cx  = np.random.rand(N).astype(np.float32)
cy  = np.random.rand(N).astype(np.float32)

td = np.zeros(N*TN, dtype=np.float32)
ti = np.zeros(N*TN, dtype=np.int32)


d_cx = cuda.to_device(cx)
d_cy = cuda.to_device(cy)
d_td = cuda.device_array_like(td)
d_ti = cuda.device_array_like(ti)
k[N/threadsperblock, threadsperblock](d_cx, d_cy, d_td, d_ti, N)
d_td.copy_to_host(td)
d_ti.copy_to_host(ti)
for i in range(TN):
    for j in range(1):
        print(ti[i*N+j])
        print(td[i*N+j])

I haven't included validation in this numba cuda variant, so it may contain defects. 我没有在此numba cuda变体中包含验证,因此它可能包含缺陷。

EDIT: By switching the above code examples from 128 threads per block to 64 threads per block, the codes will run noticeably faster. 编辑:通过将上述代码示例从每个块128个线程切换到每个块64个线程,这些代码将运行得更快。 This is due to the increase in the theoretical and achieved occupancy that this allows, in light of the shared memory limiter to occupancy. 这是由于根据共享内存限制器的占用情况,理论上和已达到的占用量增加了。

As pointed out in the comments by max9111, a better algorithm exists. 正如max9111的评论中指出的那样,存在更好的算法。 Here's a comparison (on a very slow GPU) between the above python code and the tree-based algorithm (borrowing from the answer here ): 这是上面的python代码和基于树的算法(从此处的答案中借出的)之间的比较(在非常慢的GPU上):

$ cat t27.py
import numpy as np
import numba as nb
import math
from numba import cuda,float32,int32
from scipy import spatial
import time

#Create some coordinates and indices
#It is assumed that the coordinates are unique (only one entry per hydrant)
ncities = 200064
Coords=np.random.rand(ncities*2).reshape(ncities,2)
Coords*=100
Indices=np.arange(ncities) #Indices
nnear = 33
def get_indices_of_nearest_neighbours(Coords,Indices):
  tree=spatial.cKDTree(Coords)
  #k=4 because the first entry is the nearest neighbour
  # of a point with itself
  res=tree.query(Coords, k=nnear)[1][:,1:]
  return Indices[res]

start = time.time()
a = get_indices_of_nearest_neighbours(Coords, Indices)
end = time.time()
print (a.shape[0],a.shape[1])
print
print(end -start)

# number of cities, should be divisible by threadblock size
N = ncities
# number of "closest" cities
TN = nnear - 1
#number of threads per block
threadsperblock = 64
#shared memory size of each array in elements
smSize=TN*threadsperblock
@cuda.jit('int32(float32[:])',device=True)
def find_max(dist):
    idx = cuda.threadIdx.x
    mymax = 0
    maxd = dist[idx]
    for i in range(TN-1):
        idx += threadsperblock
        mynext = dist[idx]
        if maxd < mynext:
            mymax = i+1
            maxd = mynext
    return mymax

@cuda.jit('void(float32[:], float32[:], float32[:], int32[:], int32)')
def k(citiesx, citiesy, td, ti, nc):
    dist = cuda.shared.array(smSize, float32)
    city = cuda.shared.array(smSize, int32)
    idx = cuda.grid(1)
    tid = cuda.threadIdx.x
    wl = tid & 31
    my_city_x = citiesx[idx];
    my_city_y = citiesy[idx];
    last_max = 99999.0
    for i in range(TN):
        dist[tid+i*threadsperblock] = 99999.0
    for i in xrange(0,nc,32):
        city_id = i+wl
        next_city_x = citiesx[city_id]
        next_city_y = citiesy[city_id]
        for j in range(32):
            if j > 0:
                src_lane = (wl+1)&31
                city_id = cuda.shfl_sync(0xFFFFFFFF, city_id, src_lane)
                next_city_x = cuda.shfl_sync(0xFFFFFFFF, next_city_x, src_lane)
                next_city_y = cuda.shfl_sync(0xFFFFFFFF, next_city_y, src_lane)
            if idx != city_id:
                my_dist = (next_city_x - my_city_x)*(next_city_x - my_city_x) + (next_city_y - my_city_y)*(next_city_y - my_city_y)
                if my_dist < last_max:
                    mymax = find_max(dist)
                    last_max = dist[mymax*threadsperblock+tid]
                    if my_dist < last_max:
                        dist[mymax*threadsperblock+tid] = my_dist
                        city[mymax*threadsperblock+tid] = city_id
    for i in range(TN):
        mymax = find_max(dist)
        td[idx+i*nc] = math.sqrt(dist[mymax*threadsperblock+tid])
        ti[idx+i*nc] = city[mymax*threadsperblock+tid]
        dist[mymax*threadsperblock+tid] = 0

#peform test
cx  = np.zeros(N).astype(np.float32)
cy  = np.zeros(N).astype(np.float32)
for i in range(N):
  cx[i] = Coords[i,0]
  cy[i] = Coords[i,1]
td = np.zeros(N*TN, dtype=np.float32)
ti = np.zeros(N*TN, dtype=np.int32)

start = time.time()
d_cx = cuda.to_device(cx)
d_cy = cuda.to_device(cy)
d_td = cuda.device_array_like(td)
d_ti = cuda.device_array_like(ti)
k[N/threadsperblock, threadsperblock](d_cx, d_cy, d_td, d_ti, N)
d_td.copy_to_host(td)
d_ti.copy_to_host(ti)
end = time.time()
print(a)
print
print(end - start)
print(np.flip(np.transpose(ti.reshape(nnear-1,ncities)), 1))
$ python t27.py
(200064, 32)

1.32144594193
[[177220  25281 159413 ..., 133366  45179  27508]
 [ 56956 163534  90723 ..., 135081  33025 167104]
 [ 88708  42958 162851 ..., 115964  77180  31684]
 ...,
 [186540  52500 124911 ..., 157102  83407  38152]
 [151053 144837  34487 ..., 171148  37887  12591]
 [ 13820 199968  88667 ...,   7241 172376  51969]]

44.1796119213
[[177220  25281 159413 ..., 133366  45179  27508]
 [ 56956 163534  90723 ..., 135081  33025 167104]
 [ 88708  42958 162851 ..., 115964  77180  31684]
 ...,
 [186540  52500 124911 ..., 157102  83407  38152]
 [151053 144837  34487 ..., 171148  37887  12591]
 [ 13820 199968  88667 ...,   7241 172376  51969]]
$

On this very slow Quadro K2000 GPU, the CPU/scipy-based kd-tree algorithm is much faster than this GPU implementation. 在这个非常慢的Quadro K2000 GPU上,基于CPU / scipy的kd-tree算法比该GPU实现要快得多。 However, at ~1.3s, the scipy implementation is still a bit slower than the brute-force CUDA C++ code running on a Tesla P100, and faster GPUs exist today. 但是,在约1.3s时,scipy的实现仍比在Tesla P100上运行的蛮力CUDA C ++代码慢一点,并且当今存在更快的GPU。 A possible explanation for this disparity is given here . 这里给出了这种差异的可能解释。 As pointed out there, the brute force approach is easily parallelizable, whereas the kd-tree approach may be non-trivial to parallelize. 如此处指出的那样,蛮力方法很容易并行化,而kd-tree方法对于并行化可能并不简单。 In any event, a faster solution yet might be a GPU-based kd-tree implementation . 无论如何,更快的解决方案可能是基于GPU的kd-tree实现

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM