简体   繁体   中英

Determining optimal block size for blocked matrix multiplication

I am trying to implement blocked (tiled) matrix multiplication on a single processor. I have read the literature on why blocking improves memory performance, but I just wanted to ask how to determine the optimal block size. I need to perform C+A*B where A, B, C are floating-point square matrices of the same dimension. It makes sense that 3 blocks should fit in the cache at once, so should the block size be the cache size divided by 3? Or should the block size be something else?

Finally, can anyone suggest a viable experimental way to determine the optimal block size on the supercomputer I am working with? I am working with GCC C.

I am trying to implement blocked (tiled) matrix multiplication on a single processor.

Notice that in 2021 most processors are multi-core . You might be interested by POSIX pthreads . See pthreads(7) .

I need to perform C+A*B where A, B, C are floating-point square matrices of the same dimension. It makes sense that 3 blocks should fit in the cache at once, so should the block size be the cache size divided by 3?

I am not an expert, but I don't think it is that simple. CPU cache size is often some power of 2, and you have more than one cache level.

Read about BLAS and consider using it.

Finally, can anyone suggest a viable experimental way to determine the optimal block size on the supercomputer I am working with?

I assume that supercomputer runs Linux, and you can compile C code on it with some GCC and execute it and dlopen(3) it, if it is compiled as a plugin . Read Drepper's paper How to write shared libraries for details.

Then, after reading time(7) , you could write some C program (inspired by my manydl.c ) which generates various different temporary C files defining C functions using different block sizes, compile -using system(3) - some /tmp/generated1234.c file with gcc -O3 -Wall -shared -fPIC /tmp/generated1234.c -o /tmp/generated1234.so , dlopen(3) that "/tmp/generated1234.so" , dlsym(3) these C functions, call them thru pointers, and measure the CPU time of each such plugin.

I need to perform C+A*B where A, B, C are floating-point square matrices of the same dimension.

Alternatively, some supercomputers have OpenCL (or CUDA ) implementations. You could learn OpenCL (or CUDA) and code them some critical numerical kernel routines in OpenCL (or CUDA), or generate OpenCL (or CUDA) code like you would generate C code.

Of course you want a recent GCC, eg GCC 10 in spring 2021. And you probably want to read about all the possible optimization flags , including OpenACC and OpenMP

I even guess you might use machine learning techniques to find the optimal block size....

Read also about Open-MPI

Be aware of /proc/cpuinfo documented in proc(5)

You could also contact other super computer users both in your country and elsewhere. Weather forecasting organizations (in France, MeteoFrance ), or engineers doing CAD in various industries (automotive, defense, aerospace, ...) comes to mind. Or CERN (or even my employer CEA ) or people from ITER (in Europe) or LLNL (in the USA)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM