制作多處理器字符串評估算法

Question

我有一個Python算法，它將兩個字符串作為輸入，並對每個字符進行各種測試以返回分數。 這通常涉及數百對字符串，並且由於它不涉及寫入內存，因此並發問題不應該成為問題。

事情是，根據我（很少）的GPU編程經驗，我記得在為GPU（OpenGL着色器）編碼時需要進行簡單的循環並為每個數組指定固定的長度，這很煩人，因為字符串實際上是具有可變數組長度的數組。

我可以考慮將Python字符串轉換為類似C的char數組，但這似乎是一個乏味的解決方案，並且不能解決創建簡單循環的問題。

我的問題是這個； 有沒有辦法通過將這樣的Python代碼並行化到GPU來獲得巨大的性能提升？ 可能嗎？

def evaluator( baseStr, listOfStr ) :
    for word in listOfStr : # PARALLELIZE THIS
        scoreList += [ evaluateTwoWords(baseStr, word) ];

def evaluateTwoWords(baseStr, otherStr) :

    SOME WORD-WISE COMPARISON

    i = 0; j = 0;

    while i < len(baseStr) and j < len(word) :
         ...

    return someScore;

Answer 1

對於上面提供的代碼，是的，如果GPU上的每個線程/工作人員都被分配了執行字符串比較的任務，那么您可以在GPU上實現顯着的加速。

但是GPU有一些限制。

1) If the string list to be loaded into the device memory is too huge,then  
   lost of system bandwidth is utilized to copy the string list from the 
   host to device memory. This context switch is one of the biggest setbacks 
   of using a GPU

2) Also a GPU becomes very effective in solving algorithms that have a lot 
   of SIMD(Single Instruction Multiple Data) characteristics. Check  
   this out for more info on SIMD https://en.wikipedia.org/wiki/SIMD. So the 
   more you start deviating from  SIMD,  the more penaltiy on speedup

以下是程序的示例Pycuda版本

我使用過PyCuda，但也有其他OpencL python驅動程序也可以完成這項工作。由於硬件限制，我沒有在下面測試GPU代碼，但是我主要從這些示例中進行了編碼http：//wiki.tiker .net / PyCuda / Examples 。

這就是代碼的作用。

1）將字符串列表復制到gpu設備內存

2）將基本字符串復制到設備內存

3）然后調用內核函數返回一些信息

4）最后使用求和或所需的期望減少函數來減少返回的值

下面的代碼是SIMD的完美示例，其中一個線程的結果獨立於另一個線程的結果。 但這只是一個理想的情況。 您可能必須決定算法是否適合GPU。

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

import numpy


string_list = ['Apple','Microsoft', 'Google','Facebook', 'Twitter']
string_list_lines = numpy.array( string_list, dtype=str)

#Allocalte mem  to list of strings on the GPU device
string_list_linesGPU = cuda.mem_alloc(string_list_lines.size * string_list_lines.dtype.itemsize)
#After allocation of mem,  copy it to gpu device memory
cuda.memcpy_htod(string_list_linesGPU, string_list_lines)

## ****** Now GPU device has list of strings loaded into it
## Same process applied for the base string too

baseStr = "Seagate"
baseStrGPU = cuda.mem_alloc( len(baseStr))
cuda.memcpy_htod(baseStrGPU, baseStr)

#Num of blocks
blocks = len(string_list)

#Threads per block
threadsPerBlock = 1


#Write the actual kernel function

mod = SourceModule("""
__global__ int evaluateTwoWords(char *string1, char **string2)
{
    idx = threadIdx.x;

    while len(string1) > len(string2){
        string2[i][0] = string1[0]s
        // you could probably foloow up with some kind of algorithm here 
    } 
    return len(string2)
}
""")

#Run the source model
gpusin = mod.get_function("evaluateTwoWords")
result  = 0
result += gpusin(destGPU, linesGPU, grid=(blocks,1), block=(threadsPerBlock,1,1))

return result

希望這可以幫助！

制作多處理器字符串評估算法

問題描述

1 個解決方案

解決方案1
2 已采納 2015-07-06 04:38:02

制作多處理器字符串評估算法

問題描述

1 個解決方案

解決方案1 2 已采納 2015-07-06 04:38:02

解決方案1
2 已采納 2015-07-06 04:38:02