简体   繁体   中英

Solving n linear systems efficiently

I have n (very large) independent linear systems (Ax = b_i). They all have the same A, but b_i is different for (i = 1, ..., n). I want to solve these n systems in parallel in CUDA.

I was thinking that it might be most efficient to do the LU factorization of A in the host and then copy the new A to the constant memory of GPU (because even if I do the LU in device, only one thread should do it and other threads will be idle. Besides, constant memory is faster). Is there any better way for this?

Another issue is that while all threads are solving their system at the same time with the same algorithm, they are all accessing the same place of memory (A[i]) at the same time, which is not coalesced. How can I optimize this ?

(This is assuming A is an stably -invertible nxn matrix.)

Don't solve a much harder problem just because it seems to parallelize better

Let B be the matrix whose columns are b_1 ... b_n. Under our assumptions about A, you actually need to solve the equation AX = B for an nxn matrix of variables, ie your solution is A^{-1}B.

So basically you have one matrix inversion and one matrix multiplication. This holds regardless of what software and hardware you're going to use. For inversion and multiplication just use CUBLAS, or cuSparse, or cuSOLVER, or ArrayFire or whatever solves these things the fastest.

You could do both of them together I suppose, but I'm not sure there are optimizations for that).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM