[英]Efficient algorithm to find number of elements less than a query
I have two unsorted arrays a
and b
. 我有两个未排序的数组a
和b
。 For every element a[i]
I need to find the number of elements b[j]
such that b[j] < a[i]
. 对于每个元素a[i]
我需要找到元素b[j]
的数量,使得b[j] < a[i]
。 In addition b
may contain duplicates which should not be counted. 另外, b
可能包含不应计入的重复项。 Both arrays may be very large. 两个阵列都可能非常大。
I tried (for a single query x
) 我尝试了(对于单个查询x
)
public static void main(String arg[]) {
int x = 5;
int b[] = {2, 4, 3, 4, 11, 13, 17};
Arrays.sort(b);
int count = 0;
for(int i = 0; i < b.length; ++i) {
if(b[i] < x) {
if(i == 0)
++count;
else {
// this check avoids counting duplicates
if(b[i - 1] != b[i])
++count;
}
} else {
break;
}
}
System.out.println(count);
}
My problem is that this doesn't perform well enough when querying all elements of a
iteratively. 我的问题是,查询中的所有元素时,这种不执行不够好a
迭代。 What can I do to speed this up? 我该怎么做才能加快速度?
EDIT: given the later comments, some updates which I just put right in the beginning; 编辑:考虑到以后的评论,我刚开始就提出了一些更新; leaving my first text at the bottom. 将我的第一句话留在底部。
So, the core aspects here are: 因此,这里的核心方面是:
OK, so about your problem. 好,关于您的问题。 Thing is: actually, that is just "work". 事实是:实际上,这只是“工作”。 There is no magic there. 那里没有魔术。 As you have two very large arrays, working on unsorted data is an absolute no-go. 由于您有两个非常大的数组,因此对未排序的数据进行操作绝对是不行的。
So, you start by sorting both arrays. 因此,首先对两个数组进行排序。
Then you iterate over the first array and while doing that, you also look into the second array: 然后,您遍历第一个数组,在执行此操作的同时,还要查看第二个数组:
int indexWithinB = 0;
int counterForCurrentA = 0; // and actually ALL values from a before
for (int i=0; i<a.length; i++) {
int currentA = a[i];
while (b[indexWithinB] < currentA) {
if (indexWithinB > 0) { // check required to avoid using 0-1
if (b[indexWithinB-1] != b[indexWithinB] { // avoid counting duplicates!
counterForCurrentA++;
}
}
indexWithinB++;
}
// while loop ended, this means: b[indexWithinB] == or > currentA
// this also means: counterForCurrentA ... should have the correct value
}
The above is obviously pseudo code. 上面显然是伪代码。 It is meant to keep you going; 它旨在使您继续前进; and it might very well be, that there are subtle errors in there. 那里很可能有细微的错误。 For example, as Andreas pointed out: the above needs to be enhanced to check for b.length, too. 例如,正如安德里亚斯(Andreas)所指出的:还需要对上述内容进行增强以检查b.length。 But that is left as exercise to the reader. 但这留给读者练习。
That is what I meant with "just work": you simply have to sit down, write testcases and refine my draft algorithm until it does the job for you. 这就是我所说的“正常工作”的意思:您只需要坐下来,编写测试用例并完善我的算法草稿,直到它为您完成工作即可。 If you find it too hard to program this initially, then take a piece of paper, put down two arrays with numbers ... and do that counting manually. 如果您发现很难一开始就编写程序,则拿一张纸,放下两个带有数字的数组...,然后手动进行计数。
Final hint: I suggest to write plenty of unit tests to test your algorithm (such stuff is perfect for unit tests); 最后提示:我建议编写大量的单元测试来测试您的算法(这类内容非常适合单元测试); and make sure that you have all your corner cases in such tests. 并确保您在此类测试中拥有所有重要案例。 You want to be 100% sure that your algorithm is correct before going for your 10^5 element arrays! 您想要在使用10 ^ 5元素数组之前100%确保算法正确!
And here, as promised, the original answer: 和这里一样,原始的答案:
Simply spoken: iterating and counting is the most efficient way to solve this problem. 简单地说:迭代和计数是解决此问题的最有效方法。 So in your above case, leaving out the sorting might lead to quicker overall execution time. 因此,在上述情况下,不进行排序可能会缩短整体执行时间。
The logic there is really simple: in order to know the count of numbers smaller than x ... you have to look at all of them. 那里的逻辑真的很简单:为了知道小于x的数字计数,您必须查看所有这些数字。 Thus you have to iterate the complete array (when that array is not sorted). 因此,您必须迭代整个数组(当该数组未排序时)。
Thus, given your initial statement, there is no other thing than: iterate and count. 因此,给定您的初始声明,没有其他事情了:迭代并计数。
Of course, if you have to this counting multiple times ... it might be worth sorting that data initially. 当然,如果您必须多次进行计数...可能值得一开始对数据进行排序。 Because then you can use binary search , and getting that count you are looking for works without iterating all data. 因为这样您就可以使用二进制搜索 ,并且获得该计数就可以在不迭代所有数据的情况下寻找工作。
And: what makes you think that iterating an array with 10^5 elements is a problem? 并且:是什么让您认为迭代具有10 ^ 5个元素的数组是一个问题? In other words: are you just worried about a potential performance problem, or do you have a real performance problem? 换句话说:您只是担心潜在的性能问题,还是真正的性能问题? You see, at some point you probably had to create and fill that array. 您会看到,有时可能必须创建并填充该数组。 That for sure took more time (and resources) than a simple for-loop to count entries. 当然,这比简单的for循环对条目进行计数要花费更多的时间(和资源)。 And honestly: unless we are talking some small embedded device ... 10^5 elements ... that is close to nothing , even when using slightly outdated hardware. 老实说:除非我们使用的是小型嵌入式设备... 10 ^ 5个元素...甚至在使用稍微陈旧的硬件时也几乎没有 。
Finally: when you are worried about runtime , the simple answer is: slice your input data, and use 2,4, 8, ... threads to count each slice in parallel! 最后:当您担心运行时时 ,简单的答案是:对输入数据进行切片,并使用2,4、8 ...线程并行计算每个切片! But as said: before writing that code, I would do some profiling be sure that you really have to spent precious development time on this. 但是如前所述:在编写该代码之前,我将进行一些性能分析,以确保您确实必须为此花费宝贵的开发时间。 Don't solve hypothetical performance problems; 不要解决假设的性能问题; focus on those that really matter to you or your users! 专注于对您或您的用户真正重要的内容!
Comapring every item in the array with x will take you O(n) time. 将数组中的每个项目与x共同映射将花费O(n)时间。 Sorting the array will take O(n log n), and then you can use binary search, which is O(log n), and you get a total of O(n log n). 对数组进行排序将得到O(n log n),然后可以使用二进制搜索,即O(log n),则总数为O(n log n)。 So the most efficient way is also the trivial one - just loop thru the array and compare every item with x. 因此,最有效的方法也是简单的方法-只需遍历数组并将每个项目与x进行比较。
public static void main(String arg[] ){
int b[]={2, 4, 3, 4, 11, 13, 17};
int x=5;
int count=0;
for(int i=0;i<b.length;i++){
if(b[i]<x){
count++;
}
}
System.out.println(count);
}
I propose you to consider the following approach, but it works only if the b
array has non-negative numbers. 我建议您考虑使用以下方法,但是仅当b
数组具有非负数时,该方法才有效。 The algorithm works even if input arrays (both a
and b
) are not sorted. 即使未对输入数组( a
和b
)进行排序,该算法也有效。
Pseudo-code 伪码
max
element of array b
. 获取数组b
的max
元素。 c
of size max + 1
and put 1
in the position c[b[i]]
. 创建一个大小为max + 1
的新数组c
,并将1
放在位置c[b[i]]
。 Create a new array d
of size max + 1
and populate it as follow: 创建一个大小为max + 1
的新数组d
,并将其填充如下:
d[0]=0;
d[i]=d[i-1] + c[i];
Create a new array e
of size n
and populate it as follow: 创建一个大小为n
的新数组e
,并将其填充如下:
if(a[i] > max) then e[i] = last(d)
otherwise e[i]=d[a[i]-1];
e
array represents the solution: it contains in i-th position the counter of numbers of the b
array lower then the i-th element of array a
. e
数组表示解决方案:它在第i个位置包含b
数组的编号计数器,其数量低于数组a
的第i个元素。 This example should be more explicative than the pseudo-code: 此示例应比伪代码更具解释性:
a = [5, 1, 4, 8, 17, 12, 22]
b = [2, 4, 3, 4, 11, 13, 17]
c = [0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1]
d = [0, 0, 1, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6]
e = [3, 0, 2, 3, 5, 4, 6]
Complexity 复杂
Steps 1, 2 and 4 are O(n).
Step 3 is O(max(b))
if the input array b
contains only "short" numbers (max(b) is in the same order of n
size) the algorithm perform in O(n)
. 如果输入数组b
仅包含“短”数字(max(b)的大小为n
的相同顺序),则算法以O(n)
执行。 The algorithm could be optimized creating an array of size max-min+1
and consider counter 0
for all the elements of a
array lower than min(b)
. 可以对算法进行优化,以创建大小为max-min+1
的数组,并为小于min(b)
a
数组的所有元素考虑计数器0
。
A simple java implementation: 一个简单的java实现:
int a[] = {5, 1, 4, 8, 17, 12, 22};
int b[] = {2, 4, 3, 4, 11, 13, 17};
int max = Arrays.stream(b).max().getAsInt();
int c[] = new int[max+1];
int d[] = new int[max+1];
int e[] = new int[a.length];
for(int i=0;i<b.length;i++){
c[b[i]]=1;
}
for(int i=1;i<c.length;i++){
d[i] = d[i-1] + c[i];
}
for (int i = 0; i<a.length;i++){
e[i]=(a[i]>max)?d[d.length-1]:d[a[i]-1];
}
System.out.println(Arrays.toString(a));
System.out.println(Arrays.toString(b));
System.out.println(Arrays.toString(c));
System.out.println(Arrays.toString(d));
System.out.println(Arrays.toString(e));
For larger sorted set we need to use Divide-And-Conquer principle to fasten our search.Here is my solution which has O(logn) Time complexity and O(n) space complexity. 对于更大的排序集,我们需要使用分而治之原理来加快搜索速度。这是我的解决方案,具有O(logn)时间复杂度和O(n)空间复杂度。
public static void main(String arg[]) {
int x = 5;
int b[] = {2, 4, 3, 4, 11, 13, 17};
int high = b.length - 1;
int low = 0;
while (high >= low) {
int mid = (high + low) / 2;
if (b[mid] < x)
low = mid + 1;
else
high = mid - 1;
}
System.out.println(low);
} }
This should be a possible solution. 这应该是一个可能的解决方案。 The"expensive" task is the sorting of the lists. “昂贵”的任务是对列表进行排序。 The bost list must be sorted before the for loop. Bost列表必须在for循环之前排序。 Make sure you use a fast mechanism to execute the sorting. 确保使用快速机制执行排序。 As explaned a sort on an array /array list is a very expension operation especially if there are many values you have to sort. 解释说,对数组/数组列表进行排序是一项非常昂贵的操作,尤其是当您必须对许多值进行排序时。
public static void main(String[] args) throws IOException {
// int x = 5;
int a[] = { 1, 2, 3, 4, 5 };
int b[] = { 2, 4, 3, 4, 11, 13, 17 };
List<Integer> listA = new LinkedList<>();
for (int i : a) {
listA.add(i);
}
List<Integer> listB = new LinkedList<>();
for (int i : b) {
listB.add(i);
}
Collections.sort(listA);
Collections.sort(listB);
int smallerValues = 0;
int lastValue = 0;
Iterator<Integer> iterator = listB.iterator();
int nextValue = iterator.next();
for (Integer x : listA) {
while (nextValue < x && iterator.hasNext()) {
lastValue = nextValue;
nextValue = iterator.next();
if (nextValue > lastValue) {
smallerValues++;
}
}
System.out.println(x + " - " + smallerValues);
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.