简体   繁体   English

C++ 多线程比单线程慢

[英]C++ multi-thread slower than single thread

I have some code like this:我有一些这样的代码:

class MyTask {
public:
    run(size_t pool_size) {
        ... // do some pre things
        std::vector<std::string> name_list=read_a_list(); // read task list

        std::vector<std::pair<std::string, double>> result_list; // name & time

        boost::thread_pool pool(pool_size); // "pool_size" threads in pool
        size_t max_task=2*pool_size;        // max "2*pool_size" tasks in queue
        size_t task_number=0;               // using task_number to limit the number of tasks in queue
        boost::mutex task_number_mu;
        boost::condition_variable_any task_number_condition;

        for(size_t idx=0;idx<name_list.size();++idx){
             boost::unique_lock<boost::mutex> out_lock(task_number_mu);
             task_number_condition.wait(out_lock, [&] {
                 return task_number < max_task;
                 });
             ++task_number;
             boost::asio::post(pool,
                  [&,idx] {
                      {
                          boost::unique_lock<boost::mutex> in_lock(task_number_mu);
                          --task_number;
                          task_number_condition.notify_one();
                      }
                      std::string name=name_list[idx];
                      Timer timer; // a class using std::chrono to collect time
                      timer.Start();

                      A a=read_A_data(name+"_a.csv"); // one file
                      timer.Stop();
                      double time_a=timer.Elapsed();

                      B b=read_B_data(name+"_b"); // many files in "name_b" directory
                      timer.Stop();
                      double time_b=timer.Elapsed();

                      result_type result=do_some_selection(a,b); // very expensive function
                      timer.Stop();
                      double time_r=timer.Elapsed();

                      write_result(result,name+"_result.csv"); // one file
                      timer.Stop();
                      double time_w=timer.Elapsed();

                      ... // output idx, time_{a,b,r,w} by boost log

                      {
                           boost::lock_guard<boost::mutex> lock(result_list_mu);
                           result_list.emplace_back(std::make_pair(name,time_w));
                      }
                });//post
           }//for
      pool.join();
      ... // do some other things
   } //run

public :
   static A read_A_data(const std::string& name_a){
         ... // read "name_a" file, less than 1.5M 
   }
   static B read_B_data(const std::string& name_b){
         ... // read files in "name_b" directory, more than 10 files, 1M~10M per file
   }
   static result_type do_some_selection(A a,B b){
         result_type result;
         for(const auto& i:b){
              for(const auto& j:a){
                   if(b_fit_in_a(i,j)){ //b_fit_in_a() does not have loops
                       result.emplace_back(i);
                   }//if
              }//for j
         }//for i
         return result;
   }
   static void write_result(const result_type& result, const std::string& name_r){
         ... // write result to "name_r", about 2M~15M
   }
}

When I set pool_size to 1 (single thread), the time output is like:当我将pool_size设置为 1(单线程)时,output 的时间是这样的:

1 14.7845 471.214 1491.16 1927.86
2 4.247 649.694 1327 1523.7
3 5.4375 924.407 2852.44 3276.1
4 4.1798 754.361 1078.97 1187.15
5 5.4944 1284.37 2935.02 3336.19
6 5.515 694.369 2825.79 3380.3
...

I have a Xeon-W which is 16C32T, so set pool_size to 8, and:我有一个 Xeon-W,它是 16C32T,所以将pool_size设置为 8,并且:

1 14.7919 2685.21 6600.4 7306.15
2 16.0127 2311.94 10517.2 12044.3
3 7.4403 2111.83 6210.49 7014.61
4 9.0292 2165.12 10482.5 11893
5 16.6851 1664.2 17282.7 20489.9
6 32.9876 6488.17 25730.6 25744.7
...

set 16, and:第 16 组,并且:

1 22.5189 5324.67 18018.6 20386
2 17.1096 8670.3 21245.8 23229.1
3 17.9065 10930.7 27335.3 29961.55
4 20.6321 5227.19 30733 34926
5 25.104 2372.04 13810.9 15916.7
6 39.6723 18734.3 79300.1 79393.5
...

set 32, and:第 32 组,并且:

1 39.3981 19159.7 43451.7 44527.1
2 51.1908 5693.48 43391.3 50314.4
3 42.4458 18068.6 59520.6 67359.4
4 44.1195 29214.7 70312.4 76902
5 64.1733 23071.1 86055.2 86146.7
6 44.1062 36277.5 89474.4 98104.7
...

I understand that multithreaded programs often have disk read/write problems, which explains the increase in time_a , time_b and time_w .我了解多线程程序经常存在磁盘读/写问题,这解释了time_atime_btime_w的增加。 But what confused me is that time_r increased a lot as well.但让我感到困惑的是time_r也增加了很多。 do_some_selection is a static member function, so I don't think the threads will interact, but it seems that the more threads I use, the more time one task will take. do_some_selection是 static 成员 function,所以我不认为线程会交互,但似乎我使用的线程越多,一个任务花费的时间就越多。 What did I do wrong?我做错了什么? How can I make these kind of tasks parallel?我怎样才能使这些任务并行?

First, you should display data in a sensible manner.首先,您应该以合理的方式显示数据。 As it is - it is hard to make any assessments.事实上 - 很难做出任何评估。 Like print time difference - so we can easily see how much time each task took instead of "how much time passed from the tasks' beginning".就像打印时差一样——所以我们可以很容易地看到每个任务花费了多少时间,而不是“从任务开始经过了多少时间”。

Second, the tasks you run are mostly disk read/write and it is not quite parallelizable.其次,您运行的任务主要是磁盘读/写,并且不太可并行化。 So total execution time will not change by much.所以总执行时间不会有太大变化。 As you schedule several unrelated tasks - they will all finish up at about the same time were it a single thread.当您安排几个不相关的任务时 - 如果它是一个线程,它们将几乎在同一时间完成。 However, since you run multiple threads each task will compete for resources - thus delaying each tasks' completion till most tasks are done.但是,由于您运行多个线程,每个任务都会竞争资源 - 因此会延迟每个任务的完成,直到大多数任务完成。

About why "unrelated computation-only" is slowed down.关于为什么“仅不相关的计算”会变慢。 This depends a lot on the computation you perform.这在很大程度上取决于您执行的计算。 Cannot say too much as it now aside from some generic could-be reasons.除了一些通用的可能原因之外,现在不能说太多。 From the looks of it, you perform some memory manipulation.从外观上看,您执行了一些 memory 操作。 RAM memory access is restricted by memory bus and is generally slow. RAM memory 访问受到 memory 总线的限制,并且通常很慢。 In single-threaded case a lot of the data could be still stored in the processor's memory cache speeding up considerably the amount of time it takes to process it.在单线程情况下,许多数据仍可以存储在处理器的 memory 缓存中,从而大大加快了处理数据的时间。 But this is just a general guess of what the reason could be.但这只是对可能原因的一般猜测。 You ought to make a deeper analysis to find the bottleneck - on PCs processors memory bus should be more than sufficient for multiple threads.您应该进行更深入的分析以找到瓶颈 - 在 PC 处理器上 memory 总线对于多线程应该绰绰有余。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM