简体   繁体   中英

Implementing large linear regression models using CUDA

For analysis of 10^6 genetic factors and their GeneXGene interactions (~5x10^11), I have numerous and independent linear regression problems which are probably suitable for analysis on GPUs.

The objective is to exhaustively search for GeneXGene interaction effects in modulating an outcome variable (a brain phenotype) using linear regression with the interaction term included.

As far as I know, the Householder QR factorization could be the solution for fitting regression models, however, given that each regression matrix in this particular work could easily approach the size of ~ 10'000x10, even each single regression matrix does not seem to fit in GPU on-chip memory (shared, registers etc.).

Should I accept this as a problem which is inherently bandwidth-limited and keep the matrices in GPU global memory during regression analysis, or are other strategies possible?

EDIT Here are more details about the problem:

There will be approximately 10'000 subjects, each with ~1M genetic parameters (genetic matrix:10'000x10^6). The algorithm in each iteration should select two columns of this genetic matrix (10'000x2) and also around 6 other variables unrelated to genetic data (age, gender etc) so the final regression model will be dealing with a matrix like the size of 10'000x[2(genetic factors)+6(covariates)+2(intercept&interaction term)] and an outcome variable vector (10'000x1). This same process will be repeated ~5e11 times each time with a given pair of genetic factors. Those models passing a predefined statistical threshold should be saved as output.

The specific problem is that although there are ~5e11 separate regression models, even a single one does not seem to fit in on-chip memory.

I also guess that sticking with CUDA libraries may not be the solution here as this mandates most of the data manipulation to take place on the CPU side and only sending each QR decomposition to GPU?

You whole data matrix (1e4 x 1e6) may be too large to fit in the global memory, while each of your least squares solving (1e4 x 10) may be too small to fully utilize the GPU.


For each least squares problem, you could use cuSolver for QR factorization and triangular solving.

http://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-geqrf

If the problem size is too small to fully utilize the GPU, you could use concurrent kernel execution to solve multiple equations at the same time.

https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/


For the whole data matrix, if it can not fit into the global memory, you could work on only part of it at a time. For example you could divide the matrix into ten (1e4 x 1e5) blocks, each time you load two of the blocks through PCIe, select all possible two-column combinations from the two blocks respectively, solve the equation and then load another two blocks. Maximize the block size will help you minimize the PCIe data transfer. With proper design, I'm sure the time for PCIe data transfer will be much smaller than solving 1e12 equations. Furthermore you could overlap the data transfer with solver kernel executions.

https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM