I recently had an interview with a social media company, where I was asked the following question.
There are k unsorted arrays of numbers of length m . The goal is to find a-th to b-th smallest elements across k arrays in an efficient and memory conservative way, given a < b < m . In the follow-up question, the “unsorted arrays” is changed to columns across different tables in MySQL database, what possible efficient data structure could be used and what the corresponding retrieval algorithms are.
Two possible solutions I come up with:
First: brute-force:
For the first step to find b-th smallest element using quickselect, the average time is from O(km) to O(km * log(m)) in total. Step 2 time complexity is O(km) . The last step is to find elements between a-th and b-th smallest elements in C , taking O((ba)log(kb)) . So total requires O(km) to O(km * log(m)) + O((ba)log(kb)) in time, and O(kb) in space.
Second: recursively popping out the smallest elements
For each loop, do
So the computational complexity is O(k * log(k)) + O(b * log(k)) with space complexity as O(max( k , ba )) . This seems to be the minimal space complexity.
What are the more efficient ways to do this? Especially the worst case of quickselect is O(n^2) , which seems too big, and for b = m/2 right at the median O(kb) in space or O(b * log(k)) in time was considered too big. For MySQL database, I suggested using B-tree which gives fast rank select in solution 1 while there is still O(kb) in both space and time, with k queries into the database. While in solution 2, it's said that b queries into the MySQL DB is too large and B-tree insertion is O(log(m)) where m can be very large.
One easy way is to create a max-heap of size b . Then run this code:
for arr in arrays // process each of the k arrays in turn
for i = 0 to length(k)-1
if heap.count < b
heap.push(arr[i])
else if (arr[i] < heap.peek())
heap.pop()
heap.push(arr[i])
The idea here is that you fill a max-heap with the first b items. Then, for every other item, if it's smaller than the largest item on the heap, you remove the largest item on the heap with the new item.
When you've processed all km items, the smallest b items are on the heap, and since it's a max-heap, the first ba items you pop will be the a th through b th items in all k arrays.
// all items have been processed, take the first *b - a* items from the max heap
for i = 0 to (b-a-1)
result[i] = heap.pop()
Worst case is O(km log b) for the first loop, and O(b log b) for the second loop, using O(b) additional memory.
If you're allowed to destroy the source arrays, you could write a custom quickselect that indexes the k arrays as a single array. That would be O(km), using O(k) extra memory for an indirect index. The downside being that the indexing code would somewhat slower. And, of course, that items would move among arrays. And you'd probably want O(b) additional memory for the return value. Asymptotically it's more efficient than my original selection. Whether it would run faster is another question entirely.
One other possibility. Run the build-heap method on each of the k arrays. That'd be O(km). Then do a merge to select the first b items. The merge would require:
The second step would be O(b * (log m + log b + log b)).
That gives a total of O(km + b * (log m + log b + log b)), and you'd use O(b) extra memory. Whether that would be faster than the original suggestion is questionable. It depends on the relationship between b and m . The larger the value of b , the less likely this is to be faster. And the code is a lot more complex to write.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.