简体   繁体   English

在k个数组中查找第a至第b最小元素的有效方法

[英]Efficient way to find the a-th to b-th smallest elements across k arrays

I recently had an interview with a social media company, where I was asked the following question. 最近,我接受了一家社交媒体公司的采访,被问到以下问题。

There are k unsorted arrays of numbers of length m . k个长度为m的未排序数组。 The goal is to find a-th to b-th smallest elements across k arrays in an efficient and memory conservative way, given a < b < m . 目标是在给定a < b < m的情况下,以有效且记忆保守的方式在k个数组中找到 a至第b的最小元素。 In the follow-up question, the “unsorted arrays” is changed to columns across different tables in MySQL database, what possible efficient data structure could be used and what the corresponding retrieval algorithms are. 在后续问题中,将“未排序的数组”更改为MySQL数据库中不同表的列,可以使用什么可能的有效数据结构以及相应的检索算法。

Two possible solutions I come up with: 我提出了两种可能的解决方案:

First: brute-force: 第一:蛮力:

  1. First find the b-th smallest elements for each array using quickselect. 首先使用quickselect找到每个数组的第b个最小元素。
  2. Then find elements smaller than the b-th element of each array and store them into a size k * b B-tree C . 然后找到小于每个数组第b个元素的元素,并将其存储到大小为k * b的 B树C中
  3. Then find the a-th to b-th smallest elements in C . 然后在C中找到第a到第 b的最小元素。

For the first step to find b-th smallest element using quickselect, the average time is from O(km) to O(km * log(m)) in total. 要使用quickselect查找第b个最小元素的第一步,平均时间总计为O(km)O(km * log(m)) Step 2 time complexity is O(km) . 步骤2时间复杂度为O(km) The last step is to find elements between a-th and b-th smallest elements in C , taking O((ba)log(kb)) . 最后一步是在C中找到位于 a个元素与第 b个元素之间元素,取O((ba)log(kb)) So total requires O(km) to O(km * log(m)) + O((ba)log(kb)) in time, and O(kb) in space. 因此,总计需要O(km)O(km * log(m)) + O((ba)log(kb))的时间,以及O(kb)的空间。

Second: recursively popping out the smallest elements 第二:递归地弹出最小的元素

For each loop, do 对于每个循环,执行

  1. Find the smallest element for all k arrays, store in a B-tree C 找到所有k个数组的最小元素,存储在B树C中
  2. Find the smallest element in C , and pop this element from C , and from the array it comes. C中找到最小的元素,然后从C中弹出该元素,然后从数组中弹出该元素。
  3. Repeat until a-1 numbers are popped, then go to 4 重复直到弹出a-1数字,然后转到4
  4. Store the values from a to b while repeating 1 to 2 重复1到2时,存储从ab的值

So the computational complexity is O(k * log(k)) + O(b * log(k)) with space complexity as O(max( k , ba )) . 因此,计算复杂度为O(k * log(k)) + O(b * log(k)) ,空间复杂度为O(max(k,ba)) This seems to be the minimal space complexity. 这似乎是最小的空间复杂度。

What are the more efficient ways to do this? 有哪些更有效的方法? Especially the worst case of quickselect is O(n^2) , which seems too big, and for b = m/2 right at the median O(kb) in space or O(b * log(k)) in time was considered too big. 特别是快速选择的最坏情况是O(n ^ 2) ,它似乎太大了,并且对于b = m / 2恰好在空间的中值O(kb)或时间上的O(b * log(k))处考虑太大。 For MySQL database, I suggested using B-tree which gives fast rank select in solution 1 while there is still O(kb) in both space and time, with k queries into the database. 对于MySQL数据库,我建议使用B树,该B树在解决方案1中提供了快速的排名选择,而空间和时间仍然是O(kb) ,其中k个查询进入数据库。 While in solution 2, it's said that b queries into the MySQL DB is too large and B-tree insertion is O(log(m)) where m can be very large. 在解决方案2中,据说b对MySQL DB的查询太大,而B树插入是O(log(m)) ,其中m可能非常大。

One easy way is to create a max-heap of size b . 一种简单的方法是创建大小为b的最大堆。 Then run this code: 然后运行以下代码:

for arr in arrays // process each of the k arrays in turn
    for i = 0 to length(k)-1
        if heap.count < b
            heap.push(arr[i])
        else if (arr[i] < heap.peek())
            heap.pop()
            heap.push(arr[i])

The idea here is that you fill a max-heap with the first b items. 这里的想法是用前b个项目填充最大堆。 Then, for every other item, if it's smaller than the largest item on the heap, you remove the largest item on the heap with the new item. 然后,对于其他所有项目,如果它小于堆中的最大项目,则使用新项目删除堆中的最大项目。

When you've processed all km items, the smallest b items are on the heap, and since it's a max-heap, the first ba items you pop will be the a th through b th items in all k arrays. 处理完所有km个项目后,堆上最小的b个项目,由于是最大堆,因此您弹出的前ba个项目将是所有k个数组中的 a到 b 项目。

// all items have been processed, take the first *b - a* items from the max heap
for i = 0 to (b-a-1)
   result[i] = heap.pop()

Worst case is O(km log b) for the first loop, and O(b log b) for the second loop, using O(b) additional memory. 最糟糕的情况是,使用第一个循环的O(km log b),使用第二个循环的O(b log b),使用O(b)附加内存。

If you're allowed to destroy the source arrays, you could write a custom quickselect that indexes the k arrays as a single array. 如果可以销毁源数组,则可以编写一个自定义的quickselect来将k个数组索引为单个数组。 That would be O(km), using O(k) extra memory for an indirect index. 这将是O(km),将O(k)的额外内存用于间接索引。 The downside being that the indexing code would somewhat slower. 缺点是索引代码会稍微慢一些。 And, of course, that items would move among arrays. 而且,当然,这些项将在数组之间移动。 And you'd probably want O(b) additional memory for the return value. 而且您可能想要O(b)额外的内存作为返回值。 Asymptotically it's more efficient than my original selection. 渐近地,它比我最初的选择更有效。 Whether it would run faster is another question entirely. 它是否会运行得更快,完全是另一个问题。

One other possibility. 另一种可能性。 Run the build-heap method on each of the k arrays. k个数组中的每一个上运行build-heap方法。 That'd be O(km). 那将是O(km)。 Then do a merge to select the first b items. 然后进行合并以选择前b个项目。 The merge would require: 合并将需要:

  • O(log m) to remove each item from the source arrays O(log m)从源数组中删除每个项目
  • O(log b) to add each item to the merge heap O(log b)将每个项目添加到合并堆
  • O(log b) to remove each item from the merge heap O(log b)从合并堆中删除每个项目

The second step would be O(b * (log m + log b + log b)). 第二步将是O(b *(log m + log b + log b))。

That gives a total of O(km + b * (log m + log b + log b)), and you'd use O(b) extra memory. 这样总共得到O(km + b *(log m + log b + log b)),并且您将使用O(b)额外的内存。 Whether that would be faster than the original suggestion is questionable. 这是否会比最初的建议更快。 It depends on the relationship between b and m . 这取决于bm之间的关系。 The larger the value of b , the less likely this is to be faster. b的值越大,越快越不可能。 And the code is a lot more complex to write. 而且代码编写起来要复杂得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM