简体   繁体   English

在1.5秒内找到超过2000万个3到4个不同整数的中位数

[英]Finding Median of more than 20 Million of 3 to 4 different integers in 1.5 seconds

I am trying to sort and find the median of a string of integers that only contains 3 to 4 different integers. 我试图排序并找到一个只包含3到4个不同整数的整数字符串的中位数。

The amount of numbers I am dealing with is of magnitudes of about 20 to 25 million and I am supposed to sort the vector and find the median each time a new integer is added into the vector and add the median into a separate "Total" variable which sums up all the medians each time a median is generated. 我正在处理的数字量大约为2千万到2千5百万,我应该对向量进行排序,每次将新整数添加到向量中时找到中位数,并将中位数添加到单独的“总计”变量中每次生成中位数时,它会汇总所有中位数。

1                   Median: 1              Total: 1
1 , 2               Median: (1+2)/2 = 1    Total: 1 + 1 = 2
1 , 2 , 3           Median: 2              Total: 2 + 2 = 4
1 , 1 , 2 , 3       Median: (1+2)/2 = 1    Total: 4 + 1 = 5
1 , 1 , 1 , 2 , 3   Median: 1              Total: 5 + 1 = 6

I am trying to find a way to optimize my code further because it is just not efficient enough. 我试图找到一种方法来进一步优化我的代码,因为它不够高效。 (Got to process under 2s or so) Does anyone have any idea how to further speed up my code logic? (必须在2s左右处理)有没有人知道如何进一步加快我的代码逻辑?

I am currently using 2 heaps, or priority queues in C++. 我目前在C ++中使用2个堆或优先级队列。 One functioning as a max-heap and the other functioning as a min-heap. 一个用作最大堆,另一个用作最小堆。

Gotten the idea from Data structure to find median 数据结构中找到了寻找中位数的想法

You can use 2 heaps, that we will call Left and Right.
Left is a Max-Heap.
Right is a Min-Heap.
Insertion is done like this:

If the new element x is smaller than the root of Left then we insert x to 
Left.
Else we insert x to Right.
If after insertion Left has count of elements that is greater than 1 from 
the count of elements of Right, then we call Extract-Max on Left and insert 
it to Right.
Else if after insertion Right has count of elements that is greater than the 
count of elements of Left, then we call Extract-Min on Right and insert it 
to Left.
The median is always the root of Left.

So insertion is done in O(lg n) time and getting the median is done in O(1) 
time.

but it is just not fast enough... 但它还不够快......

If you only ever have three to four distinct integers in the string, you can just keep track of how many times each one appears by traversing the string once. 如果字符串中只有三到四个不同的整数,则可以通过遍历字符串一次来跟踪每个整数出现的次数。 Adding (and removing elements) from this representation is also doable in constant time. 从这种表示中添加(和删除元素)也是可以在恒定时间内完成的。

class MedianFinder
{
public:
  MedianFinder(const std::vector<int>& inputString)
  {
    for (int element : inputString)
      _counts[element]++; // Inserts 0 into map if element is not in there.
  }

  void addStringEntry(int entry)
  {
    _counts[entry]++;
  }

  int getMedian() const
  {
    size_t numberOfElements = 0;
    for (auto kvp : _counts)
      numberOfElements += kvp.second;

    size_t cumulativeCount = 0;
    int lastValueBeforeMedian;
    for (auto kvp : _counts)
    {
      cumulativeCount += kvp.second;
      if (cumulativeCount >= numberOfElements/2)
        lastValueBeforeMedian = kvp.first;
    }

    // TODO! Handle the case of the median being in between two buckets.
    //return ...
  }

private:
  std::map<int, size_t> _counts;
};

The trivial task of summing the medians is not shown here. 这里没有显示总结中位数的微不足道的任务。

I would not focus on optimizing as much as decreasing the complexity from O(n * log n) to O(n) . 我不会专注于优化从O(n * log n)O(n)的复杂度降低。

Your algorithm is O(n * log n) because you do n insertions each costing amortized O(log n) time. 您的算法是O(n * log n)因为您执行n插入,每次插入都计算摊销的O(log n)时间。

There is a well known O(n) algorithm for median finding . 存在用于中值发现的众所周知的O(n) 算法 I suggest using this. 我建议用这个。

Usually log n is not a big deal, but for 20 Million elements it can make your algorithm ~25 times faster. 通常log n不是什么大问题,但对于2000万个元素,它可以使你的算法快25倍。

Oh, my bad. 哦,我的坏。 I didn't realize there are only 3-4 different integers... 我没有意识到只有3-4个不同的整数...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM