I have a large array of size n (say n = 1000000) with values monotonically non-decreasing. I have a set of 'k' key values (say k = { 1,23,39,55,..}). Assume key values are sorted. I have to find the index of these key values in the large array using minimum number of comparisons. How do I use binary search to search for multiple unique values? Doing it separately for each key value takes lot of comparisons. Can I use reuse some knowledge I learned in one search somehow when I search for another element on the same big array?
{0, len(haystack)}
. These pairs represent all the knowledge we have of the possible locations of the needles. There may be some slight complication here when you have duplicate values in the haystack, but I think once you have the rest sorted out this should not be too difficult.
I was curious if NumPy implemented anything like this. The Python name for what you're doing is numpy.searchsorted()
, and once you get through the API layers it comes to this :
/*
* Updating only one of the indices based on the previous key
* gives the search a big boost when keys are sorted, but slightly
* slows down things for purely random ones.
*/
if (@TYPE@_LT(last_key_val, key_val)) {
max_idx = arr_len;
}
else {
min_idx = 0;
max_idx = (max_idx < arr_len) ? (max_idx + 1) : arr_len;
}
So they do not do a full-blown optimization like I described, but they do track when the current needle is greater than the last needle, they can avoid searching the haystack below where the last needle was found. This is a simple and elegant improvement over the naive implementation, and as seen from the comments, it must be kept simple and fast because the function does not require the needles to be sorted in the first place.
By the way: my proposed solution aimed for something like theoretical optimality in big-O terms, but if you have a large number of needles, the fastest thing to do is probably to sort the needles then iterate over the entire haystack and all the needles in tandem: linear-search for the first needle, then resume from there to look for the second, etc. You can even skip every second item in the haystack by recognizing that if a needle is greater than A and less than C, it must belong at position B (assuming you don't care about the left/right insertion order for needles not in the haystack). You can then do about len(haystack)/2 comparisons and the entire thing will be very cache-friendly (after sorting the needles, of course).
One way to reuse knowledge from previous steps is like others suggested: once you have located a key, you can restrict the search ranges for the smaller and larger keys.
Assuming N=2^n, K=2^k and lucky outcomes: after finding the middle key, (n comparisons), you have two subarrays of size N/2. Perform 2 searches for the "quartile" keys (n-1 comparisons each), reducing to N/4 subarrays...
In total, n + 2(n-1) + 4(n-2) + ... + 2^(k-1)(n-k+1) comparisons. After a bit of math, this equals roughly Kn-Kk = K.(nk).
This is a best case scenario and the savings are not so significant compared to independent searches (Kn comparisons). Anyway, the worst case (all searches resulting in imbalanced partitions) is not worse than independent searches.
UPDATE : this is an instance of the Minimum Comparison Merging problem
Finding the locations of the K keys in the array of N values is the same as merging the two sorted sequences.
From Knuth Vol. 3, Section 5.3.2, we know that at least ceiling(lg(C(N+K,K)))
comparisons are required (because there are C(N+K,K)
ways to intersperse the keys in the array). When K is much smaller than N, this is close to lg((N^K/K!)
, or K lg(N) - K lg(K) = K.(nk)
.
This bound cannot be beaten by any comparison-based method, so any such algorithm will take time essentially proportional to the number of keys.
While not optimal it is much easier to implement.
If you have array of ints, and you want to search for minimum number of comparisons, I want to suggest you interpolation search from Knuth, 6.2.1. If binary search requires Log(N) iterations (and comparisons), interpolation search requires only Log(Log(N)) operations.
For details and code sample see:
I know the question was regarding C, but I just did an implementation of this in Javascript I thought I'd share. Not intended to work if you have duplicate elements in the array...I think it will just return any of the possible indexes in that case. For an array with 1 million elements where you search for each element its about 2.5x faster. If you also search for elements that are not contained in the array then its even faster. In one data set I through at it it was several times faster. For small arrays its about the same
singleSearch=function(array, num) {
return this.singleSearch_(array, num, 0, array.length)
}
singleSearch_=function(array, num, left, right){
while (left < right) {
var middle =(left + right) >> 1;
var midValue = array[middle];
if (num > midValue) {
left = middle + 1;
} else {
right = middle;
}
}
return left;
};
multiSearch=function(array, nums) {
var numsLength=nums.length;
var results=new Int32Array(numsLength);
this.multiSearch_(array, nums, 0, array.length, 0, numsLength, results);
return results;
};
multiSearch_=function(array, nums, left, right, numsLeft, numsRight, results) {
var middle = (left + right) >> 1;
var midValue = array[middle];
var numsMiddle = this.singleSearch_(nums, midValue, numsLeft, numsRight);
if ((numsRight - numsLeft) > 1) {
if (middle + 1 < right) {
var newLeft = middle;
var newRight = middle;
if ((numsRight - numsMiddle) > 0) {
this.multiSearch_(array, nums, newLeft, right, numsMiddle, numsRight, results);
}
if (numsMiddle - numsLeft > 0) {
this.multiSearch_(array, nums, left, newRight, numsLeft, numsMiddle, results);
}
}
else {
for (var i = numsLeft; i < numsRight; i++) {
var result = this.singleSearch_(array, nums[i], left, right);
results[i] = result;
}
}
}
else {
var result = this.singleSearch_(array, nums[numsLeft], left, right);
results[numsLeft] = result;
};
}
// A recursive binary search based function. It returns index of x in // given array arr[l..r] is present, otherwise -1.
int binarySearch(int arr[], int l, int r, int x)
{
if (r >= l)
{
int mid = l + (r - l)/2;
// If the element is present at one of the middle 3 positions
if (arr[mid] == x) return mid;
if (mid > l && arr[mid-1] == x) return (mid - 1);
if (mid < r && arr[mid+1] == x) return (mid + 1);
// If element is smaller than mid, then it can only be present
// in left subarray
if (arr[mid] > x) return binarySearch(arr, l, mid-2, x);
// Else the element can only be present in right subarray
return binarySearch(arr, mid+2, r, x);
}
// We reach here when element is not present in array
return -1;
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.