简体繁体中英

Implementing large linear regression models using CUDA

原文 2016-07-01 21:31:30 3 1 cuda/ linear-regression/ matrix-factorization

For analysis of 10^6 genetic factors and their GeneXGene interactions (~5x10^11), I have numerous and independent linear regression problems which are probably suitable for analysis on GPUs.

The objective is to exhaustively search for GeneXGene interaction effects in modulating an outcome variable (a brain phenotype) using linear regression with the interaction term included.

As far as I know, the Householder QR factorization could be the solution for fitting regression models, however, given that each regression matrix in this particular work could easily approach the size of ~ 10'000x10, even each single regression matrix does not seem to fit in GPU on-chip memory (shared, registers etc.).

Should I accept this as a problem which is inherently bandwidth-limited and keep the matrices in GPU global memory during regression analysis, or are other strategies possible?

EDIT Here are more details about the problem:

There will be approximately 10'000 subjects, each with ~1M genetic parameters (genetic matrix:10'000x10^6). The algorithm in each iteration should select two columns of this genetic matrix (10'000x2) and also around 6 other variables unrelated to genetic data (age, gender etc) so the final regression model will be dealing with a matrix like the size of 10'000x[2(genetic factors)+6(covariates)+2(intercept&interaction term)] and an outcome variable vector (10'000x1). This same process will be repeated ~5e11 times each time with a given pair of genetic factors. Those models passing a predefined statistical threshold should be saved as output.

The specific problem is that although there are ~5e11 separate regression models, even a single one does not seem to fit in on-chip memory.

I also guess that sticking with CUDA libraries may not be the solution here as this mandates most of the data manipulation to take place on the CPU side and only sending each QR decomposition to GPU?

1 answers

You whole data matrix (1e4 x 1e6) may be too large to fit in the global memory, while each of your least squares solving (1e4 x 10) may be too small to fully utilize the GPU.

For each least squares problem, you could use cuSolver for QR factorization and triangular solving.

http://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-geqrf

If the problem size is too small to fully utilize the GPU, you could use concurrent kernel execution to solve multiple equations at the same time.

https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

For the whole data matrix, if it can not fit into the global memory, you could work on only part of it at a time. For example you could divide the matrix into ten (1e4 x 1e5) blocks, each time you load two of the blocks through PCIe, select all possible two-column combinations from the two blocks respectively, solve the equation and then load another two blocks. Maximize the block size will help you minimize the PCIe data transfer. With proper design, I'm sure the time for PCIe data transfer will be much smaller than solving 1e12 equations. Furthermore you could overlap the data transfer with solver kernel executions.

https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

Cuda linear interpolation using textures

Solving linear system using Python with numba and CUDA

Implementing Dijkstra's algorithm using CUDA in c

Solving sparse linear systems in CUDA using LU factorization

Cuda thread linear indexing

CUDA Texture Linear Filtering

implementing in CUDA a large boolean sparse matrix (having possibly 10 million entries) for RDF triples

Implementing a critical section in CUDA

implementing random generator in cuda

Implementing Max Reduce in Cuda

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Cuda linear interpolation using textures Solving linear system using Python with numba and CUDA Implementing Dijkstra's algorithm using CUDA in c Solving sparse linear systems in CUDA using LU factorization Cuda thread linear indexing CUDA Texture Linear Filtering implementing in CUDA a large boolean sparse matrix (having possibly 10 million entries) for RDF triples Implementing a critical section in CUDA implementing random generator in cuda Implementing Max Reduce in Cuda

Related Tags

Implementing large linear regression models using CUDA

Question

1 answers

solution1 2 ACCPTED 2016-07-02 09:54:22

solution1
2 ACCPTED 2016-07-02 09:54:22