简体繁体中英

Determining optimal block size for blocked matrix multiplication

原文 2021-01-06 06:52:24 6 1 c/ memory-management/ matrix-multiplication/ cpu-cache/ cache-locality

I am trying to implement blocked (tiled) matrix multiplication on a single processor. I have read the literature on why blocking improves memory performance, but I just wanted to ask how to determine the optimal block size. I need to perform C+A*B where A, B, C are floating-point square matrices of the same dimension. It makes sense that 3 blocks should fit in the cache at once, so should the block size be the cache size divided by 3? Or should the block size be something else?

Finally, can anyone suggest a viable experimental way to determine the optimal block size on the supercomputer I am working with? I am working with GCC C.

1 answers

I am trying to implement blocked (tiled) matrix multiplication on a single processor.

Notice that in 2021 most processors are multi-core . You might be interested by POSIX pthreads . See pthreads(7) .

I need to perform C+A*B where A, B, C are floating-point square matrices of the same dimension. It makes sense that 3 blocks should fit in the cache at once, so should the block size be the cache size divided by 3?

I am not an expert, but I don't think it is that simple. CPU cache size is often some power of 2, and you have more than one cache level.

Read about BLAS and consider using it.

Finally, can anyone suggest a viable experimental way to determine the optimal block size on the supercomputer I am working with?

I assume that supercomputer runs Linux, and you can compile C code on it with some GCC and execute it and dlopen(3) it, if it is compiled as a plugin . Read Drepper's paper How to write shared libraries for details.

Then, after reading time(7) , you could write some C program (inspired by my manydl.c ) which generates various different temporary C files defining C functions using different block sizes, compile -using system(3) - some /tmp/generated1234.c file with gcc -O3 -Wall -shared -fPIC /tmp/generated1234.c -o /tmp/generated1234.so , dlopen(3) that "/tmp/generated1234.so" , dlsym(3) these C functions, call them thru pointers, and measure the CPU time of each such plugin.

I need to perform C+A*B where A, B, C are floating-point square matrices of the same dimension.

Alternatively, some supercomputers have OpenCL (or CUDA ) implementations. You could learn OpenCL (or CUDA) and code them some critical numerical kernel routines in OpenCL (or CUDA), or generate OpenCL (or CUDA) code like you would generate C code.

Of course you want a recent GCC, eg GCC 10 in spring 2021. And you probably want to read about all the possible optimization flags , including OpenACC and OpenMP

I even guess you might use machine learning techniques to find the optimal block size....

Read also about Open-MPI

Be aware of /proc/cpuinfo documented in proc(5)

You could also contact other super computer users both in your country and elsewhere. Weather forecasting organizations (in France, MeteoFrance ), or engineers doing CAD in various industries (automotive, defense, aerospace, ...) comes to mind. Or CERN (or even my employer CEA ) or people from ITER (in Europe) or LLNL (in the USA)

CUDA: Tiled matrix-matrix multiplication with shared memory and matrix size which is non-multiple of the block size

cuda matrix multiplication size

Matrix-Multiplication: Why non-blocked outperforms blocked?

Programatically Determining the File System Block Size

Matrix multiplication: Small difference in matrix size, large difference in timings

Matrix multiplication

Vector-matrix & matrix-matrix multiplication using SSE for any size of input matrix and vector

Float size, matrix multiplication, OpenCL, sockets. Weird

Matrix Multiplication of size 100*100 using SSE Intrinsics

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question CUDA: Tiled matrix-matrix multiplication with shared memory and matrix size which is non-multiple of the block size cuda matrix multiplication size Matrix-Multiplication: Why non-blocked outperforms blocked? Programatically Determining the File System Block Size Matrix multiplication: Small difference in matrix size, large difference in timings Matrix multiplication Matrix multiplication Vector-matrix & matrix-matrix multiplication using SSE for any size of input matrix and vector Float size, matrix multiplication, OpenCL, sockets. Weird Matrix Multiplication of size 100*100 using SSE Intrinsics

Related Tags

Determining optimal block size for blocked matrix multiplication

Question

1 answers

solution1 2 2021-01-06 07:36:41

solution1
2 2021-01-06 07:36:41