简体   繁体   English

为什么遍历比在两个排序的std :: list上合并更耗时?

[英]Why is traversing more time-consuming than merging on two sorted std::list?

I am pretty amazed on the result that traversing takes more time than merging on two sorted std::list by around 12%. 我非常惊讶的结果是,遍历比在两个排序的std::list上合并需要更多的时间大约12%。 Since merging can be considered and implemented as continuous element comparisons, list splice and iterators traversal through two separated sorted linked lists. 由于可以将合并视为连续元素比较并将其实现,因此列表拼接和迭代器遍历两个分开的已排序链接列表。 Hence, traversing should not be slower than merging through them especially when two lists are large enough because the ratio of iterated elements is getting increased. 因此,遍历不应该比通过它们合并慢,特别是当两个列表足够大时,因为迭代元素的比率正在增加。

However, the result seems to not match what I thought, and this is how I test my ideas above: 然而,结果似乎与我的想法不符,这就是我如何测试我的想法:

std::list<int> list1, list2;

for (int cnt = 0; cnt < 1 << 22; cnt++)
    list1.push_back(rand());
for (int cnt = 0; cnt < 1 << 23; cnt++)
    list2.push_back(rand());

list1.sort();
list2.sort();

auto start = std::chrono::system_clock::now();  // C++ wall clock

// Choose either one option below
list1.merge(list2);         // Option 1
for (auto num : list1);     // Option 2
for (auto num : list2);     // Option 2

std::chrono::duration<double> diff = std::chrono::system_clock::now() - start;
std::cout << std::setprecision(9) << "\n       "
          << diff.count() << " seconds (measured)" << std::endl;  // show elapsed time

PS. PS。 icc is smart enough to eliminate Option 2. Try sum += num; icc非常聪明,可以消除Option 2.尝试sum += num; and print out sum . 并打印出sum

This is the output from perf : (the measured time remains the same without using perf ) 这是perf的输出:(测量时间保持不变,不使用perf

Option 1: Merge 选项1:合并

       0.904575206 seconds (measured)

 Performance counter stats for './option-1-merge':

    33,395,981,671      cpu-cycles
       149,371,004      cache-misses              #   49.807 % of all cacherefs
       299,898,436      cache-references
    24,254,303,068      cycle-activity.stalls-ldm-pending    

       7.678166480 seconds time elapsed

Option 2: Traverse 选项2:遍历

       1.01401903 seconds (measured)

 Performance counter stats for './option-2-traverse':

    33,844,645,296      cpu-cycles
       138,723,898      cache-misses             #   48.714 % of all cacherefs
       284,770,796      cache-references
    25,141,751,107      cycle-activity.stalls-ldm-pending

       7.806018949 seconds time elapsed

Due to the property of horrible spatial locality on these linked lists. 由于这些链接列表中可怕的空间位置的属性。 The cache miss is the major reason that make CPU stalls, and occupies most of the CPU resources. 缓存未命中是导致CPU停滞并占用大部分CPU资源的主要原因。 The strange point is that option 2 has fewer cache misses than option 1, but it has a higher amount of CPU stalls and CPU cycles to accomplish its task. 奇怪的是,选项2比选项1具有更少的缓存未命中,但它具有更高的CPU停顿和CPU周期来完成其任务。 What makes this abnormality happen? 是什么让这种异常发生?

As you know, it is memory that is taking all your time. 如你所知,记忆占用了你所有的时间。

Cache misses are bad, but so are stalls. 缓存未命中很糟糕,但是停顿也是如此。

From this paper : 这篇论文

Applications with irregular memory access patterns, eg,dereferencing chains of pointers when traversing linked lists or trees, may not generate enough concurrently outstanding requests to fully utilize the data paths. 具有不规则存储器访问模式的应用程序(例如,在遍历链表或树时解除引用指针链)可能不会生成足够的并发未完成请求以充分利用数据路径。 Nevertheless, such applications are clearly limited by the performance of memory accesses as well. 然而,这些应用程序也明显受到内存访问性能的限制。 Therefore, considering the bandwidth utilization is not sufficient to detect all memory related performance issues. 因此,考虑到带宽利用率不足以检测所有与内存相关的性能问题。

Basically, randomly walking pointers can fail to saturate the memory bandwidth. 基本上,随机行走指针可能无法使内存带宽饱和。

The tight loop on each is blocked each iteration by waiting on where the next pointer is to be loaded. 通过等待下一个指针加载的位置,每次迭代都会阻塞每个循环的紧密循环。 If it is not in cache, the cpu can do nothing -- it stalls. 如果它不在缓存中,则cpu无能为力 - 它会停止。

The combined tight loop/merge tries to load two pages into the cache. 组合的紧密循环/合并尝试将两个页面加载到缓存中。 When one is loading, sometimes the cpu can advannce on the other. 当一个人加载时, 有时 cpu可以优先于另一个。

The result you measured was that the merge has fewer stalls that the naked wasted double iteration. 您测量的结果是合并具有较少的裸体浪费双重迭代的停顿。

Or in other words, 或者换句话说,

 24,254,303,068 cycle-activity.stalls-ldm-pending 

is a big number and smaller than: 是一个很大的数字,小于:

 25,141,751,107 cycle-activity.stalls-ldm-pending 

I am surprised this is enough to make a 10% difference, but that is why perf is about measuring. 我很惊讶这足以使10%的差异,但这就是为什么perf是关于测量的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM