简体   繁体   English

在数组中找到两个最小int64元素的最快方法

[英]Fastest way of finding two minimum int64 elements in array

I have arrays with sizes from 1000 to 10000 (1k .. 10k). 我有大小从1000到10000(1k ... 10k)的数组。 Each element is int64. 每个元素都是int64。 My task is to find two smallest elements of the arrays, the minimum element and the minimum from the remaining. 我的任务是找到数组的两个最小元素,最小元素和剩余的最小元素。

I want to get fastest possible single-threaded code in C++ for Intel Core2 or Corei7 (cpu mode is 64 bit). 我希望在C ++中为Intel Core2或Corei7获得最快的单线程代码(cpu模式为64位)。

This function (getting the 2 smallest from array) is the hotspot, it is nested in two or three for loops with huge iteration count. 这个函数(从数组中得到最小的2个)是热点,它嵌套在两个或三个for循环中,具有巨大的迭代次数。

Current code is like: 目前的代码如下:

int f()
{
    int best; // index of the minimum element
    int64 min_cost = 1LL << 61;
    int64 second_min_cost = 1LL << 62;
    for (int i = 1; i < width; i++) {
     int64 cost = get_ith_element_from_array(i); // it is inlined
     if (cost < min_cost) {
        best = i;
        second_min_cost = min_cost;
        min_cost = cost;
     } else if (cost < second_min_cost) {
        second_min_cost = cost;
     }
    }
    save_min_and_next(min_cost, best, second_min_cost);
}

Look at partial_sort and nth_element 看看partial_sortnth_element

std::vector<int64_t> arr(10000); // large

std::partial_sort(arr.begin(), arr.begin()+2, arr.end());
// arr[0] and arr[1] are minimum two values

If you only wanted the second lowest value, nth_element is your guy 如果你只想要第二低的值,nth_element就是你的家伙

Try inverting the if: 尝试反转if:

if (cost < second_min_cost) 
{ 
    if (cost < min_cost) 
    { 
    } 
    else
    {
    }
} 

And you should probably initialize min_cost and second_min_cost with the same value, using the max value of int64 (or even better use the suggestion of qbert220) 你应该使用相同的值初始化min_cost和second_min_cost,使用int64的最大值(或者更好地使用qbert220的建议)

Some small things (which may be happening already, but may be worth trying I guess). 一些小事(可能已经发生,但我猜可能值得尝试)。

  1. Unroll the loop slightly - say for example iterate in strides of 8 (ie cache line at a time), pre-fetch the next cache line in the body, then process the 8 items. 稍微展开循环 - 例如,以8的步幅迭代(即一次缓存行),预取主体中的下一个缓存行,然后处理8个项目。 To avoid lots of checks, ensure the end condition is a multiple of 8, and the left over items (less than 8) should be processed outside of the loop - unrolled... 为了避免大量检查,确保结束条件是8的倍数,并且应该在循环外处理剩余的项目(小于8) - 展开...

  2. For the items of no interest, you are doing two checks in the body, may be you can trim to 1? 对于没有兴趣的物品,你在身体上做两次检查,可能你可以修剪到1? ie if the cost is less than second_min , then check min as well - else no need to bother... 即如果cost低于second_min ,那么也检查min - 否则不需要打扰......

You'd better check second_min_cost first, since it is the only condition which requires to modify the result. 您最好首先检查second_min_cost,因为它是唯一需要修改结果的条件。 This way, you'll get one branch, instead of 2, into your main loop. 这样,您将在主循环中获得一个分支,而不是2分支。 This should help quite a bit. 这应该有所帮助。

Other than that, there is very little to optimise, your are already close to optimal. 除此之外,几乎没有优化,你已经接近最优。 Unrolling may help, but i doubt it will bring any significant advantage in this scenario. 展开可能有所帮助,但我怀疑它会在这种情况下带来任何显着的优势。

So, it becomes : 所以,它变成:

int f()
{
    int best; // index of the minimum element
    int64 min_cost = 1LL << 61;
    int64 second_min_cost = 1LL << 62;
    for (int i = 1; i < width; i++) {
    int64 cost = get_ith_element_from_array(i); // it is inlined
    if (cost < second_min_cost)
    {
      if (cost < min_cost) 
      {
        best = i;
        second_min_cost = min_cost;
        min_cost = cost;
      } 
      else second_min_cost = cost;
    }
    save_min_and_next(min_cost, best, second_min_cost);
}

What you have there, is O(n) and optimal for random data. 你所拥有的是O(n)并且是随机数据的最佳选择。 That means, you already have the fastest. 这意味着,你已经拥有最快的速度。

The only way you can improve this is by giving certain properties to your array, for example, keeping it sorted at all times or by making it a heap. 唯一可以改进的方法是为数组提供某些属性,例如,始终对其进行排序或将其作为堆。

The good point is that your algorithm scans the numbers once. 好的一点是,您的算法会扫描一次数字。 You're optimal. 你是最优的。

An important source of slowness could come from the way your elements are arranged. 缓慢的一个重要原因可能来自元素的排列方式。 If they are in an array, I mean a C array (or C++ vector) where all the elements are contiguous and you scan them forward, then memory-wise you're optimal too. 如果它们在一个数组中,我的意思是一个C数组(或C ++向量),其中所有元素都是连续的,你向前扫描它们,然后在内存方面你也是最优的。 Otherwise, you could have some surprises. 否则,你可能会有一些惊喜。 For instance, if your elements are in a linked list, or scatter gathered, then you can have penalty for memory accesses. 例如,如果您的元素在链接列表中,或者收集了分散,那么您可能会因内存访问而受到惩罚。

Make sure your array-reading is will-behaved so it doesn't introduce needless cache-misses. 确保您的数组读取符合行为,因此不会引入不必要的缓存未命中。

This code should probably be very close to bandwidth-bound on modern CPU:s, assuming the array-reading is simple. 假设数组读取很简单,这段代码应该非常接近现代CPU上的带宽限制。 You need to profile and/or calculate if it still seems to have any headroom for CPU optimizations. 您需要分析和/或计算它是否仍有任何可用于CPU优化的余量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM