简体繁体 English

确定块矩阵乘法的最佳块大小

[英]Determining optimal block size for blocked matrix multiplication

原文 2021-01-06 06:52:24 3 1 c/ memory-management/ matrix-multiplication/ cpu-cache/ cache-locality

I am trying to implement blocked (tiled) matrix multiplication on a single processor.我正在尝试在单个处理器上实现阻塞（平铺）矩阵乘法。 I have read the literature on why blocking improves memory performance, but I just wanted to ask how to determine the optimal block size.我已经阅读了关于为什么阻塞会提高 memory 性能的文献，但我只是想问一下如何确定最佳块大小。 I need to perform C+A*B where A, B, C are floating-point square matrices of the same dimension.我需要执行 C+A*B，其中 A、B、C 是相同维度的浮点方阵。 It makes sense that 3 blocks should fit in the cache at once, so should the block size be the cache size divided by 3? 3 个块应该一次放入缓存是有道理的，那么块大小应该是缓存大小除以 3 吗？ Or should the block size be something else?或者块大小应该是别的东西吗？

Finally, can anyone suggest a viable experimental way to determine the optimal block size on the supercomputer I am working with?最后，任何人都可以提出一种可行的实验方法来确定我正在使用的超级计算机上的最佳块大小吗？ I am working with GCC C.我正在使用 GCC C。

1 个解决方案

I am trying to implement blocked (tiled) matrix multiplication on a single processor.我正在尝试在单个处理器上实现阻塞（平铺）矩阵乘法。

Notice that in 2021 most processors are multi-core .请注意，到 2021 年，大多数处理器都是多核的。 You might be interested by POSIX pthreads .您可能对POSIX pthreads感兴趣。 See pthreads(7) .参见pthreads(7) 。

I need to perform C+A*B where A, B, C are floating-point square matrices of the same dimension.我需要执行 C+A*B，其中 A、B、C 是相同维度的浮点方阵。 It makes sense that 3 blocks should fit in the cache at once, so should the block size be the cache size divided by 3? 3 个块应该一次放入缓存是有道理的，那么块大小应该是缓存大小除以 3 吗？

I am not an expert, but I don't think it is that simple.我不是专家，但我认为事情没有那么简单。 CPU cache size is often some power of 2, and you have more than one cache level. CPU 缓存大小通常是 2 的幂，并且您有多个缓存级别。

Read about BLAS and consider using it.阅读BLAS并考虑使用它。

Finally, can anyone suggest a viable experimental way to determine the optimal block size on the supercomputer I am working with?最后，任何人都可以提出一种可行的实验方法来确定我正在使用的超级计算机上的最佳块大小吗？

I assume that supercomputer runs Linux, and you can compile C code on it with some GCC and execute it and dlopen(3) it, if it is compiled as a plugin .我假设超级计算机运行 Linux，你可以用一些GCC编译 C 代码并执行它，如果它被编译为插件， 3 Read Drepper's paper How to write shared libraries for details.阅读 Drepper 的论文如何编写共享库以了解详细信息。

Then, after reading time(7) , you could write some C program (inspired by my manydl.c ) which generates various different temporary C files defining C functions using different block sizes, compile -using system(3) - some /tmp/generated1234.c file with gcc -O3 -Wall -shared -fPIC /tmp/generated1234.c -o /tmp/generated1234.so , dlopen(3) that "/tmp/generated1234.so" , dlsym(3) these C functions, call them thru pointers, and measure the CPU time of each such plugin. Then, after reading time(7) , you could write some C program (inspired by my manydl.c ) which generates various different temporary C files defining C functions using different block sizes, compile -using system(3) - some /tmp/generated1234.c file with gcc -O3 -Wall -shared -fPIC /tmp/generated1234.c -o /tmp/generated1234.so , dlopen(3) that "/tmp/generated1234.so" , dlsym(3) these C functions ，通过指针调用它们，并测量每个此类插件的 CPU 时间。

I need to perform C+A*B where A, B, C are floating-point square matrices of the same dimension.我需要执行 C+A*B，其中 A、B、C 是相同维度的浮点方阵。

Alternatively, some supercomputers have OpenCL (or CUDA ) implementations.或者，一些超级计算机具有OpenCL （或CUDA ）实现。 You could learn OpenCL (or CUDA) and code them some critical numerical kernel routines in OpenCL (or CUDA), or generate OpenCL (or CUDA) code like you would generate C code.您可以学习 OpenCL（或 CUDA）并在 OpenCL（或 CUDA）中编写一些关键的数字 kernel 例程，或者像生成 C 代码一样生成 OpenCL（或 CUDA）代码。

Of course you want a recent GCC, eg GCC 10 in spring 2021. And you probably want to read about all the possible optimization flags , including OpenACC and OpenMP当然你想要一个最近的 GCC，例如GCC 10 in Z2A2D595E6ED9A0B24F027F2B63B134F027F2B63B134F027F2B63B134D6Z 2021。你可能想阅读所有可能的优化标志，包括OpenACCD6Z 2021

I even guess you might use machine learning techniques to find the optimal block size....我什至猜你可能会使用机器学习技术来找到最佳块大小......

Read also about Open-MPI另请阅读Open-MPI

Be aware of /proc/cpuinfo documented in proc(5)请注意proc(5)中记录的/proc/cpuinfo

You could also contact other super computer users both in your country and elsewhere.您还可以联系您所在国家和其他地方的其他超级计算机用户。 Weather forecasting organizations (in France, MeteoFrance ), or engineers doing CAD in various industries (automotive, defense, aerospace, ...) comes to mind.天气预报组织（在法国， MeteoFrance ）或在各个行业（汽车、国防、航空航天等）从事CAD的工程师会浮现在脑海中。 Or CERN (or even my employer CEA ) or people from ITER (in Europe) or LLNL (in the USA)或CERN （甚至我的雇主CEA ）或来自ITER （在欧洲）或LLNL （在美国）的人