简体   繁体   中英

Efficient way to find the a-th to b-th smallest elements across k arrays

I recently had an interview with a social media company, where I was asked the following question.

There are k unsorted arrays of numbers of length m . The goal is to find a-th to b-th smallest elements across k arrays in an efficient and memory conservative way, given a < b < m . In the follow-up question, the “unsorted arrays” is changed to columns across different tables in MySQL database, what possible efficient data structure could be used and what the corresponding retrieval algorithms are.

Two possible solutions I come up with:

First: brute-force:

  1. First find the b-th smallest elements for each array using quickselect.
  2. Then find elements smaller than the b-th element of each array and store them into a size k * b B-tree C .
  3. Then find the a-th to b-th smallest elements in C .

For the first step to find b-th smallest element using quickselect, the average time is from O(km) to O(km * log(m)) in total. Step 2 time complexity is O(km) . The last step is to find elements between a-th and b-th smallest elements in C , taking O((ba)log(kb)) . So total requires O(km) to O(km * log(m)) + O((ba)log(kb)) in time, and O(kb) in space.

Second: recursively popping out the smallest elements

For each loop, do

  1. Find the smallest element for all k arrays, store in a B-tree C
  2. Find the smallest element in C , and pop this element from C , and from the array it comes.
  3. Repeat until a-1 numbers are popped, then go to 4
  4. Store the values from a to b while repeating 1 to 2

So the computational complexity is O(k * log(k)) + O(b * log(k)) with space complexity as O(max( k , ba )) . This seems to be the minimal space complexity.

What are the more efficient ways to do this? Especially the worst case of quickselect is O(n^2) , which seems too big, and for b = m/2 right at the median O(kb) in space or O(b * log(k)) in time was considered too big. For MySQL database, I suggested using B-tree which gives fast rank select in solution 1 while there is still O(kb) in both space and time, with k queries into the database. While in solution 2, it's said that b queries into the MySQL DB is too large and B-tree insertion is O(log(m)) where m can be very large.

One easy way is to create a max-heap of size b . Then run this code:

for arr in arrays // process each of the k arrays in turn
    for i = 0 to length(k)-1
        if heap.count < b
            heap.push(arr[i])
        else if (arr[i] < heap.peek())
            heap.pop()
            heap.push(arr[i])

The idea here is that you fill a max-heap with the first b items. Then, for every other item, if it's smaller than the largest item on the heap, you remove the largest item on the heap with the new item.

When you've processed all km items, the smallest b items are on the heap, and since it's a max-heap, the first ba items you pop will be the a th through b th items in all k arrays.

// all items have been processed, take the first *b - a* items from the max heap
for i = 0 to (b-a-1)
   result[i] = heap.pop()

Worst case is O(km log b) for the first loop, and O(b log b) for the second loop, using O(b) additional memory.

If you're allowed to destroy the source arrays, you could write a custom quickselect that indexes the k arrays as a single array. That would be O(km), using O(k) extra memory for an indirect index. The downside being that the indexing code would somewhat slower. And, of course, that items would move among arrays. And you'd probably want O(b) additional memory for the return value. Asymptotically it's more efficient than my original selection. Whether it would run faster is another question entirely.

One other possibility. Run the build-heap method on each of the k arrays. That'd be O(km). Then do a merge to select the first b items. The merge would require:

  • O(log m) to remove each item from the source arrays
  • O(log b) to add each item to the merge heap
  • O(log b) to remove each item from the merge heap

The second step would be O(b * (log m + log b + log b)).

That gives a total of O(km + b * (log m + log b + log b)), and you'd use O(b) extra memory. Whether that would be faster than the original suggestion is questionable. It depends on the relationship between b and m . The larger the value of b , the less likely this is to be faster. And the code is a lot more complex to write.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM