简体繁体 English

Matlab和GPU / CUDA编程

[英]Matlab and GPU/CUDA programming

原文 2011-09-14 06:46:44 9 2 matlab/ cuda/ gpu

I need to run several independent analyses on the same data set. 我需要对同一数据集运行多个独立的分析。 Specifically, I need to run bunches of 100 glm (generalized linear models) analyses and was thinking to take advantage of my video card (GTX580). 具体来说，我需要运行一堆100 glm（广义线性模型）分析，并正在考虑利用我的视频卡（GTX580）。

As I have access to Matlab and the Parallel Computing Toolbox (and I'm not good with C++), I decided to give it a try. 当我可以使用Matlab和Parallel Computing Toolbox（并且我对C ++不太满意）时，我决定尝试一下。

I understand that a single GLM is not ideal for parallel computing, but as I need to run 100-200 in parallel, I thought that using parfor could be a solution. 我知道单个GLM不适用于并行计算，但是由于我需要并行运行100-200，因此我认为使用parfor可能是一种解决方案。

My problem is that it is not clear to me which approach I should follow. 我的问题是我不清楚应该采用哪种方法。 I wrote a gpuArray version of the matlab function glmfit, but using parfor doesn't have any advantage over a standard "for" loop. 我编写了matlab函数glmfit的gpuArray版本，但使用parfor相比标准的“ for”循环没有任何优势。

Has this anything to do with the matlabpool setting? 这与matlabpool设置有关吗？ It is not even clear to me how to set this to "see" the GPU card. 我什至不清楚如何设置它以“查看” GPU卡。 By default, it is set to the number of cores in the CPU (4 in my case), if I'm not wrong. 默认情况下，如果我没有记错的话，它设置为CPU的内核数（在我的情况下为4）。 Am I completely wrong on the approach? 我在方法上完全错误吗？

Any suggestion would be highly appreciated. 任何建议将不胜感激。

Edit 编辑

Thanks. 谢谢。 I'm aware of GPUmat and Jacket, and I could start writing in C without too much effort, but I'm testing the GPU computing possibilities for a department where everybody uses Matlab or R. The final goal would be a cluster based on C2050 and the Matlab Distribution Server (or at least this was the first project). 我知道GPUmat和Jacket，无需费太多力气就可以开始用C编写代码，但是我正在为每个人都使用Matlab或R的部门测试GPU计算的可能性。最终目标是基于C2050的集群Matlab分发服务器（或者至少这是第一个项目）。 Reading the ADs from Mathworks I was under the impression that parallel computing was possible even without C skills. 从MathWorks上阅读AD，我的印象是，即使没有C技能，并行计算也是可能的。 It is impossible to ask the researchers in my department to learn C, so I'm guessing that GPUmat and Jacket are the better solutions, even if the limitations are quite big and the support to several commonly used routines like glm is non-existent. 不可能要求我系的研究人员学习C语言，因此我猜测GPUmat和Jacket是更好的解决方案，即使局限性很大，并且不存在对glm等几种常用例程的支持。

How can they be interfaced with a cluster? 它们如何与集群接口？ Do they work with some job distribution system? 他们是否与某些工作分配系统一起工作？

2 个解决方案

I would recommend you try either GPUMat (free) or AccelerEyes Jacket (buy, but has free trial) rather than the Parallel Computing Toolbox. 我建议您尝试使用GPUMat （免费）或AccelerEyes Jacket （购买，但有免费试用版），而不是并行计算工具箱。 The toolbox doesn't have as much functionality. 该工具箱没有太多功能。

To get the most performance, you may want to learn some C (no need for C++) and code in raw CUDA yourself. 为了获得最佳性能，您可能需要自己学习一些C语言（不需要C ++）和原始CUDA中的代码。 Many of these high level tools may not be smart enough about how they manage memory transfers (you could lose all your computational benefits from needlessly shuffling data across the PCI-E bus). 这些高级工具中的许多工具可能对它们如何管理内存传输还不够聪明（您可能会由于不必要地通过PCI-E总线改组数据而失去所有计算优势）。

Parfor will help you for utilizing multiple GPUs, but not a single GPU. Parfor将帮助您利用多个GPU，而不是单个GPU。 The thing is that a single GPU can do only one thing at a time, so parfor on a single GPU or for on a single GPU will achieve the exact same effect (as you are seeing). 问题是单个GPU一次只能做一件事，因此parfor在单个GPU上或在单个GPU上的parfor将实现完全相同的效果（如您所见）。

Jacket tends to be more efficient as it can combine multiple operations and run them more efficiently and has more features, but most departments already have parallel computing toolbox and not jacket so that can be an issue. Jacket往往效率更高，因为它可以组合多个操作并更有效地运行它们并具有更多功能，但是大多数部门已经具有并行计算工具箱，而不是Jacket，因此可能是一个问题。 You can try the demo to check. 您可以尝试演示进行检查。

No experience with gpumat. 没有使用gpumat的经验。

The parallel computing toolbox is getting better, what you need is some large matrix operations. 并行计算工具箱正在变得越来越好，您需要的是一些大型矩阵运算。 GPUs are good at doing the same thing multiple times, so you need to either combine your code somehow into one operation or make each operation big enough. GPU善于多次执行同一操作，因此您需要以某种方式将代码组合到一个操作中，或者使每个操作足够大。 We are talking a need for ~10000 things in parallel at least, although it's not a set of 1e4 matrices but rather a large matrix with at least 1e4 elements. 我们说的是至少并行需要约10000个事物，尽管它不是1e4矩阵的集合，而是一个至少包含1e4元素的大型矩阵。

I do find that with the parallel computing toolbox you still need quite a bit of inline CUDA code to be effective (it's still pretty limited). 我确实发现，使用并行计算工具箱，您仍然需要大量内联CUDA代码才能有效（它仍然很有限）。 It does better allow you to inline kernels and transform matlab code into kernels though, something that 它更好地使您可以内联内核并将Matlab代码转换为内核，