简体   繁体   English

多线程GEMM比单线程慢?

[英]Multi-threaded GEMM slower than single threaded one?

I wrote some Naiive GEMM code and I am wondering why it is much slower than the equivalent single threaded GEMM code. 我写了一些Naiive GEMM代码,我想知道为什么它比等效的单线程GEMM代码慢得多。

With a 200x200 matrix, Single Threaded: 7ms, Multi Threaded: 108ms, CPU: 3930k, 12 threads in thread pool. 使用200x200矩阵,单线程:7ms,多线程:108ms,CPU:3930k,线程池中有12个线程。

    template <unsigned M, unsigned N, unsigned P, typename T>
    static Matrix<M, P, T> multiply( const Matrix<M, N, T> &lhs, const Matrix<N, P, T> &rhs, ThreadPool & pool )
    {
        Matrix<M, P, T> result = {0};

        Task<void> task(pool);
        for (auto i=0u; i<M; ++i)
            for (auto j=0u; j<P; j++)
                task.async([&result, &lhs, &rhs, i, j](){
                    T sum = 0;
                    for (auto k=0u; k < N; ++k)
                        sum += lhs[i * N + k] * rhs[k * P + j];
                    result[i * M + j] = sum;
            });

        task.wait();

        return std::move(result);
    }

I do not have experience with GEMM, but your problem seems to be related to issues that appear in all kind of multi-threading scenarios. 我没有GEMM的经验,但您的问题似乎与所有类型的多线程场景中出现的问题有关。

When using multi-threading, you introduce a couple of potential overheads, the most common of which usually are 使用多线程时,会引入一些潜在的开销,通常是最常见的开销

  1. creation/cleanup of starting/ending threads 创建/清除开始/结束线程
  2. context switches when (number of threads) > (number of CPU cores) 上下文切换时(线程数)>(CPU内核数)
  3. locking of resources, waiting to obtain a lock 锁定资源,等待获取锁定
  4. cache synchronization issues 缓存同步问题

The items 2. and 3. probably don't play a role in your example: you are using 12 threads on 12 (hyperthreading) cores, and your algorithm does not involve locks. 项目2和3.可能在您的示例中不起作用:您在12个(超线程)核心上使用12个线程,并且您的算法不涉及锁定。

However, 1. might be relevant in your case: You are creating a total of 40000 threads, each of which multiplying and adding 200 values. 但是,1。在您的情况下可能是相关的:您正在创建总共40000个线程,每个线程相乘并添加200个值。 I'd suggest to try a less fine-grained threading, maybe only splitting after the first loop. 我建议尝试一个不太细粒度的线程,可能只在第一个循环后分裂。 It's always a good idea not to split up the problem into pieces smaller than necessary. 不要将问题分成小于必要的部分,这总是一个好主意。

Also 4. will very likely be important in your case. 4.在您的情况下,很可能也很重要。 While you're not running into a race condition when writing the results to the array (because every thread is writing to its own index position), you are very likely to provoke a large overhead of cache syncs. 虽然在将结果写入数组时没有遇到竞争条件(因为每个线程都写入其自己的索引位置),但很可能会引发大量的缓存同步开销。

"Why?" “为什么?” you might think, because you're writing to different places in memory. 你可能会想,因为你在写作记忆的不同地方。 That's because a typical CPU cache is organized in cache lines, which on the current Intel and AMD CPU models are 64 bytes long. 这是因为典型的CPU缓存是在缓存行中组织的,当前的Intel和AMD CPU型号都是64字节长。 This is the smallest size that can be used for transfers from and to the cache, when something is changed. 当更改某些内容时,这是可用于从缓存传输到缓存的最小大小。 Now that all CPU cores are reading and writing to adjacent memory words, this leads to synchronization of 64 bytes between all the cores whenever you are writing just 4 bytes (or 8, depending on the size of the data type you're using). 既然所有CPU内核都在读取和写入相邻的存储器字,那么只要你只写4个字节(或8个,取决于你正在使用的数据类型的大小),这就会导致所有内核之间的64字节同步。

If memory is not an issue, you can simply "pad" every output array element with "dummy" data so that there is only one output element per cache line. 如果内存不是问题,您只需使用“虚拟”数据“填充”每个输出数组元素,这样每个缓存行只有一个输出元素。 If you're using 4byte data types, this would mean to skip 15 array elements for each 1 real data element. 如果您使用4字节数据类型,则意味着为每个1个实数数据元素跳过15个数组元素。 The cache issues will also improve when you make your threading less fine-grained, because every thread will access its own continuous region in memory practically without interfering with other threads' memory. 当你使线程不那么细粒度时,缓存问题也会得到改善,因为每个线程都会在内存中访问自己的连续区域,而不会干扰其他线程的内存。

Edit: A more detailed description by Herb Sutter (one of the Gurus of C++) can be found here: http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273 编辑:Herb Sutter(C ++的大师之一)的更详细描述可以在这里找到: http//www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273

Edit2: BTW, it's suggested to avoid std::move in the return statement, as this might get in the way of return-value-optimization and copy-elision rules, which the standard now demands to happen automatically. Edit2:BT​​W,建议在return语句中避免std::move ,因为这可能会妨碍返回值优化和copy-elision规则,标准现在要求自动发生。 See Is returning with `std::move` sensible in the case of multiple return statements? 在多个return语句的情况下,请参阅返回`std :: move`是否合理?

Multi threading means always synchronization, context switching, function call. 多线程意味着始终同步,上下文切换,函数调用。 This all adds up and costs CPU cycles, you can spend on the main task itself. 这一切都加起来并且花费了CPU周期,你可以花在主要任务本身上。

If you have just a third nested loop, you save all these steps and can do the computation inline instead of a subroutine, where you must setup a stack, call into, switch to a different thread, return the result and switch back to the main thread. 如果您只有第三个嵌套循环,则保存所有这些步骤并且可以进行内联计算而不是子例程,您必须在其中设置堆栈,调用,切换到不同的线程,返回结果并切换回主线。

Multi threading is useful only, if these costs are small compared to the main task. 如果这些成本与主要任务相比较小,则多线程仅有用。 I guess, you will see better results with multi threading, when the matrix is larger than just 200x200. 我想,当矩阵大于200x200时,你会看到更好的多线程结果。

In general multi-threading is well applicable for tasks which take a lot of time, most favourably because of complexity and not device access. 通常,多线程适用于花费大量时间的任务,最有利的是因为复杂性而不是设备访问。 The loop you showed us takes to short to execute for it to be effectively parallelized. 你向我们展示的循环需要简短执行才能有效地并行化。

You have to remember that there is much overhead with thread creation. 你必须记住线程创建有很多开销。 There is also some (but significantly less) overhead with synchronization. 同步还有一些(但明显更少)的开销。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM