Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Question

Let RG = A for dense unstructured matrices with shapes (eg roughly) R : (30k x 40k, entries float32) and G : (40k x 50k, entries either 0.0 or 1.0, roughly equally often) and of course A : (30k x 50k, entries float32).

Given A and G , I want to find the least squares solution for R .

I can use hundreds of CPU cores, hundreds of GB of RAM and also an A40 GPU. What is the best way to use such resources to solve the problem? I'm using Julia 1.7 in the examples below but I'm open to other options!

First question: Can I somehow exploit that the entries of G are only zeros and ones?

Trying to use Julia `LinearAlgebra` with many CPUs

I've tried two methods: "Penrose inverse" and "right division"

using LinearAlgebra
@show BLAS.get_num_threads()
# defaults to 8. Can change using BLAS.set_num_threads(N)

# build toy problem (order of magnitude smaller sizes)
R_true = rand(Float32, 3_000, 4_000) 
G = rand([0., 1.], 4_000, 5_000)
# note: using true/false here gives same results but is much slower!
A = R_true * G

# solve toy problem using matrix (right) division
R_fitted_rdiv = A / G

# solve toy problem using Penrose inverse
R_fitted_pinv = (pinv(G') * A')'

First, setting BLAS.set_num_threads(64) (or any bigger number) actually only gives me BLAS.get_num_threads() returning 32. Apparantly that's an upper limit. Second,

using 32 BLAS threads is actually slower* than using 8.*

(eg performing right division with sizes (4000, 9800) / (8500, 9800) takes less than 50 seconds on 8 threads but more than 55 seconds on 32 threads. I ran things multiple times to exclude compilation time issues.) I don't know why this is or if it's normal. How can I make use of my computing power for this problem?

I think that the matrix division is faster than the Penrose inverse method. Should this be expected? I don't know what either of the functions do exactly for these inputs. The docs say that left division ( \ ) uses pivoted QR factorization . I couldn't find what algorithm(s) are used for pinv or right division ( / ) (although it's probably the same as \ since they are related by transposing the matrices). I'd rather not delve too deeply because my knowledge in numerical linear algebra is quite limited.

The issue is that for my large matrices either method takes forever. Is there a way to make use of my ~100 cores somehow?

Trying to use the GPU:

Using CUDA.jl , Matrices of size around 10k work fine and take a minute to pinv :

using CUDA
@time matrix = CUDA.rand(Float32, 10_000, 10_500) # 0.003037 seconds (5 allocations: 160 bytes)
@time pinv(matrix) #  57.417559 seconds (678 allocations: 172.094 KiB)

However, when I try to do matrices around size 20k, I get right away the error InexactError: trunc(Int32, 4811456640) . I assume this is due to CUBLAS using int32 for indexing , even though I don't understand why it leads to an error in this case. (edit: it's about the size of the array in bytes fitting into 31 bits.)

Trying to use right division with CuArray s gives the error "DimensionMismatch("LU factored matrix A must be square!")". I guess I have to choose a different algorithm manually? I don't know what it's called. (Although, it probably would still crash for large matrices...?)

To summarize, it doesn't look like I can use the GPU from Julia easily to solve my problem. Should I keep trying to use the GPU for this task or stick to the many CPUs?

Yes this is really my problem, please refrain from commenting "nobody should ever need such large least squares"

Answer 1

Naive answer

Using pytorch, this will require at least 30GB GPU memory

import torch
A = torch.randint(0, 2, (50000, 40000), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (50000, 30000), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)

If the system can sustain the same operation throughput as my laptop you should have an answer in about 15 minutes.

I would suggest you to try a generalized version scaling up the dimensions to get a better feeling of how your system will handle it

def try_it(a,b,c):
  A = torch.randint(0, 2, (a, b), device='cuda', dtype=torch.float32).T
  G = torch.randint(0, 2, (a, c), device='cuda', dtype=torch.float32).T
  R = torch.lstsq(G.T, A.T)

I transposed the dimensions in the generation in order to make sure GT and AT would be contiguous.

You can't take much advantage of the entries being integer. This type of problem is easier to solve on the reals than on the integers, because finding integer solutions would require you to search the solutions, while the real solution you can find by doing algebraic manipulations.

Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Question

Trying to use Julia `LinearAlgebra` with many CPUs

using 32 BLAS threads is actually slower* than using 8.*

Trying to use the GPU:

1 answers

solution1
1 2022-06-19 10:50:54

Parallellize least squares for large (> 30k x 30k) non-square dense matrices

Question

Trying to use Julia LinearAlgebra with many CPUs

using 32 BLAS threads is actually slower than using 8.

Trying to use the GPU:

1 answers

solution1 1 2022-06-19 10:50:54

Trying to use Julia `LinearAlgebra` with many CPUs

using 32 BLAS threads is actually slower* than using 8.*

solution1
1 2022-06-19 10:50:54