Top K 最小选择算法 - O (n + k log n) vs O (n log k) for k << N

Question

I'm asking this in regards to Top K algorithm.我问的是关于 Top K 算法的问题。 I'd think that O(n + k log n) should be faster, because well.. for instance if you try plugging in k = 300 and n = 100000000 for example, we can see that O(n + k log n) is smaller.我认为 O(n + k log n) 应该更快，因为好吧..例如，如果您尝试插入 k = 300 和 n = 100000000，我们可以看到 O(n + k log n)更小。

However when I do a benchmark with C++, it's showing me that O (n log k) is more than 2x faster.但是，当我使用 C++ 进行基准测试时，它显示 O (n log k) 的速度提高了 2 倍以上。 Here's the complete benchmarking program:这是完整的基准测试程序：

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <ctime>
#include <cstdlib>
using namespace std;

int RandomNumber () { return rand(); }
vector<int> find_topk(int arr[], int k, int n)
{
   make_heap(arr, arr + n, greater<int>());

   vector<int> result(k);

   for (int i = 0; i < k; ++i)
   {
      result[i] = arr[0];
      pop_heap(arr, arr + n - i, greater<int>());
   }

   return result;
}

vector<int> find_topk2(int arr[], int k, int n)
{
   make_heap(arr, arr + k, less<int>());

   for (int i = k; i < n; ++i)
   {
      if (arr[i] < arr[0])
      {
     pop_heap(arr, arr + k, less<int>());
     arr[k - 1] = arr[i];
     push_heap(arr, arr + k, less<int>());
      }
   }

   vector<int> result(arr, arr + k);

   return result;
}


int main()
{
   const int n = 220000000;
   const int k = 300;

   srand (time(0));
   int* arr = new int[n];

   generate(arr, arr + n, RandomNumber);

   // replace with topk or topk2
   vector<int> result = find_topk2(arr, k, n);

   copy(result.begin(), result.end(), ostream_iterator<int>(cout, "\n"));


   return 0;
}

find_topk 's approach is to build a complete heap of size n, in O(n) and then remove the top element of heap k times O(log n). find_topk 的方法是在 O(n) 中构建一个大小为 n 的完整堆，然后将堆的顶部元素移除 k 次 O(log n)。 find_topk2 's approach is to build a heap of size k (O(k)) such that max element is at the top, and then from k to n, compare to see if any element is smaller than top element, and if so pop the top element, and push the new element which would mean n times O(log k). find_topk2 的做法是构建一个大小为 k (O(k)) 的堆，使得 max 元素在顶部，然后从 k 到 n，比较看是否有任何元素小于顶部元素，如果是则 pop顶部元素，并推送新元素，这意味着 n 次 O(log k)。 Both approach are written quite similarly so I don't believe any implementation detail (like creating temporaries etc.) can cause a difference besides the algo and the dataset (which is random).两种方法的编写方式都非常相似，因此我不相信任何实现细节（如创建临时对象等）会导致除了算法和数据集（随机）之外的差异。

I could actually profile the results of the benchmark and could see that find_topk actually called the comparison operator many more times than find_topk2.我实际上可以分析基准测试的结果，并且可以看到 find_topk 实际上调用比较运算符的次数比 find_topk2 多很多。 But I'm interested more in the reasoning of the theoretical complexity.. so two questions.但我对理论复杂性的推理更感兴趣......所以两个问题。

Disregarding the implementation or benchmark, was I wrong in expecting that O(n + k log n) should be better than O(n log k)?忽略实现或基准，我期望 O(n + k log n) 应该比 O(n log k) 更好吗？ If I'm wrong, please explain why and how to reason such that I can see O(n log k) is actually better.如果我错了，请解释为什么以及如何推理以使我可以看到 O(n log k) 实际上更好。
If I'm not wrong to expect no 1. Then why is my benchmark showing otherwise?如果我期望没有 1 没有错。那么为什么我的基准测试显示不是这样呢？

Answer 1

Big O in several variables is complex, since you need assumptions on how your variables scale with one another, so you can take unambiguously the limit to infinity.多个变量中的大 O 很复杂，因为您需要假设变量如何相互缩放，因此您可以明确地将极限设为无穷大。

If eg.如果例如。 k ~ n^(1/2), then O(n log k) becomes O(n log n) and O(n + k log n) becomes O(n + n^(1/2) log n) = O(n), which is better. k ~ n^(1/2)，则 O(n log k) 变为 O(n log n) 并且 O(n + k log n) 变为 O(n + n^(1/2) log n) = O (n)，哪个更好。

If k ~ log n, then O(n log k) = O(n log log n) and O(n + k log n) = O(n), which is better.如果k ~ log n，那么O(n log k) = O(n log log n)和O(n + k log n) = O(n)，这样更好。 Note that log log 2^1024 = 10, so the constants hidden in the O(n) may be greater than log log n for any realistic n.请注意，log log 2^1024 = 10，因此对于任何实际 n，O(n) 中隐藏的常数可能大于 log log n。

If k = constant, then O(n log k) = O(n) and O(n + k log n) = O(n), which is the same.如果 k = 常数，则 O(n log k) = O(n) 和 O(n + k log n) = O(n)，这是相同的。

But the constants play a big role: for instance, building a heap may involve reading the array 3 times, whereas building a priority queue of length k as you go only requires one pass through the array, and a small constant times log k for the lookup.但是常量起着很大的作用：例如，构建一个堆可能涉及读取数组 3 次，而构建一个长度为 k 的优先级队列，如 go 只需要一次通过数组，一个小的常量时间 log k 用于抬头。

Which is "better" is therefore unclear, although my quick analysis tended to show that O(n + k log n) performs better under mild assumptions on k.因此，尚不清楚哪个“更好”，尽管我的快速分析倾向于表明 O(n + k log n) 在对 k 的温和假设下表现更好。

For instance, if k is a very small constant (say k = 3), then I'm ready to bet that the make_heap approach performs worse than the priority queue one on real world data.例如，如果 k 是一个非常小的常数（例如 k = 3），那么我可以打赌make_heap方法在现实世界数据上的性能比优先级队列一更差。

Use asymptotic analysis wisely, and above all, profile your code before drawing conclusions.明智地使用渐近分析，最重要的是，在得出结论之前分析您的代码。

Answer 2

You are comparing two worst case upper bounds.您正在比较两个最坏情况的上限。 For the first approach, the worst case is pretty much equal to the average case.对于第一种方法，最坏情况几乎等于平均情况。 For the second case, if the input is random, by the time you have passed more than a handful of items into the heap, the chance of throwing away the new value at once because it is not going to replace any of the top K is pretty high, so the worst case estimate for this is pessimistic.对于第二种情况，如果输入是随机的，当您将多个项目传递到堆中时，立即丢弃新值的机会是因为它不会替换任何前 K相当高，因此对此的最坏情况估计是悲观的。

If you are comparing wall clock time as opposed to comparisons you may find that heap based algorithms with large heaps tend not to win many races because they have horrible storage locality - and constant factors on modern microprocessors are heavily influenced by what level of memory you end up working in - finding your data is out in real memory chips (or worse, on disk) and not some level of cache will slow you down a lot - which is a shame because I really like heapsort.如果您比较的是挂钟时间而不是比较，您可能会发现具有大堆的基于堆的算法往往不会赢得很多比赛，因为它们具有可怕的存储位置 - 现代微处理器上的常数因素在很大程度上受到您结束的 memory 级别的影响开始工作 - 发现你的数据在真正的 memory 芯片中（或者更糟糕的是，在磁盘上），而不是某种级别的缓存会减慢你的速度 - 这是一种耻辱，因为我真的很喜欢堆排序。

Answer 3

Keep in mind that you can now use std::nth_element instead of having to use a heap and do things yourself.请记住，您现在可以使用 std::nth_element 而不必使用堆并自己做事。 Since the default comparator operator is std::less<>(), you can say something like this:由于默认的比较器运算符是 std::less<>()，你可以这样说：

std::nth_element(myList.begin(), myList.begin() + k, myList.end()); std::nth_element(myList.begin(), myList.begin() + k, myList.end());

Now, myList from positions 0 to k will be the smallest k elements.现在，从位置 0 到 k 的 myList 将是最小的 k 个元素。

Top K 最小选择算法 - O (n + k log n) vs O (n log k) for k << N

问题描述

3 个解决方案

解决方案1
4 已采纳 2011-07-12 19:45:28

解决方案2
1 2011-07-12 19:50:08

解决方案3
0 2015-10-09 02:17:36

Top K 最小选择算法 - O (n + k log n) vs O (n log k) for k &lt;&lt; N

问题描述

3 个解决方案

解决方案1 4 已采纳 2011-07-12 19:45:28

解决方案2 1 2011-07-12 19:50:08

解决方案3 0 2015-10-09 02:17:36

Top K 最小选择算法 - O (n + k log n) vs O (n log k) for k << N

解决方案1
4 已采纳 2011-07-12 19:45:28

解决方案2
1 2011-07-12 19:50:08

解决方案3
0 2015-10-09 02:17:36