简体   繁体   English

查找少于查询的元素数量的高效算法

[英]Efficient algorithm to find number of elements less than a query

I have two unsorted arrays a and b . 我有两个未排序的数组ab For every element a[i] I need to find the number of elements b[j] such that b[j] < a[i] . 对于每个元素a[i]我需要找到元素b[j]的数量,使得b[j] < a[i] In addition b may contain duplicates which should not be counted. 另外, b可能包含不应计入的重复项。 Both arrays may be very large. 两个阵列都可能非常大。

I tried (for a single query x ) 我尝试了(对于单个查询x

public static void main(String arg[]) {
    int x = 5;
    int b[] = {2, 4, 3, 4, 11, 13, 17};
    Arrays.sort(b);
    int count = 0;
    for(int i = 0; i < b.length; ++i) {
        if(b[i] < x) {
            if(i == 0)
                ++count;
            else {
                // this check avoids counting duplicates
                if(b[i - 1] != b[i])
                    ++count;
            }
        } else {
            break;
        }
    }
    System.out.println(count);
}

My problem is that this doesn't perform well enough when querying all elements of a iteratively. 我的问题是,查询中的所有元素时,这种不执行不够好a迭代。 What can I do to speed this up? 我该怎么做才能加快速度?

EDIT: given the later comments, some updates which I just put right in the beginning; 编辑:考虑到以后的评论,我刚开始就提出了一些更新; leaving my first text at the bottom. 将我的第一句话留在底部。

So, the core aspects here are: 因此,这里的核心方面是:

  1. You came here with some problem X, but further asking told us that you actually had some problem Y to solve. 您来到这里时遇到了问题X,但进一步询问告诉我们,您实际上有一些问题要解决。 That is something that you should try to avoid: when coming here (or when working on problems on your own!) ... then you should be able to clearly describe the problem you have or intend to solve. 那是应该避免的事情:来到这里(或自己解决问题!)...那么您应该能够清楚地描述您已经解决或打算解决的问题。 I am not fingerpointing here; 我不是在这里指指点点; just expressing that you should work hard to ensure that you understand what your real problem is. 只是表示您应该努力确保您了解真正的问题是什么。
  2. This is also visible from the fact that you are asking us what to do about duplicate numbers in your data. 正在询问我们如何处理数据中的重复数字这一事实也可以看出这一点。 Err, sir: it is your problem. 先生,先生:这是您的问题。 We do not know why you want to count those numbers; 我们不知道您为什么要计算这些数字; we do not know where your data is coming from; 我们不知道您的数据来自哪里; and how the final solution should deal with duplicate entries. 以及最终解决方案应如何处理重复的条目。 In that sense, I am just rephrasing the first paragraph: you have to clarify your requirements. 从这个意义上讲,我只是改写第一段: 必须澄清您的要求。 We can't help with that part at all . 我们不能用这部分的所有帮助。 And you see: you only mentioned duplicates in the second array. 您会看到:您仅在第二个数组中提到重复项。 What about those in the first one?! 那第一个呢?!

OK, so about your problem. 好,关于您的问题。 Thing is: actually, that is just "work". 事实是:实际上,这只是“工作”。 There is no magic there. 那里没有魔术。 As you have two very large arrays, working on unsorted data is an absolute no-go. 由于您有两个非常大的数组,因此对未排序的数据进行操作绝对是不行的。

So, you start by sorting both arrays. 因此,首先对两个数组进行排序。

Then you iterate over the first array and while doing that, you also look into the second array: 然后,您遍历第一个数组,在执行此操作的同时,还要查看第二个数组:

int indexWithinB = 0;
int counterForCurrentA = 0; // and actually ALL values from a before
for (int i=0; i<a.length; i++) {
  int currentA = a[i];     
  while (b[indexWithinB] < currentA) {
    if (indexWithinB > 0) { // check required to avoid using 0-1
      if (b[indexWithinB-1] != b[indexWithinB] { // avoid counting duplicates!
        counterForCurrentA++;
      }
    }
    indexWithinB++;
  }
  // while loop ended, this means: b[indexWithinB] == or > currentA
  // this also means: counterForCurrentA ... should have the correct value
}

The above is obviously pseudo code. 上面显然是伪代码。 It is meant to keep you going; 它旨在使您继续前进; and it might very well be, that there are subtle errors in there. 那里很可能有细微的错误。 For example, as Andreas pointed out: the above needs to be enhanced to check for b.length, too. 例如,正如安德里亚斯(Andreas)所指出的:还需要对上述内容进行增强以检查b.length。 But that is left as exercise to the reader. 但这留给读者练习。

That is what I meant with "just work": you simply have to sit down, write testcases and refine my draft algorithm until it does the job for you. 这就是我所说的“正常工作”的意思:您只需要坐下来,编写测试用例并完善我的算法草稿,直到它为您完成工作即可。 If you find it too hard to program this initially, then take a piece of paper, put down two arrays with numbers ... and do that counting manually. 如果您发现很难一开始就编写程序,则拿一张纸,放下两个带有数字的数组...,然后手动进行计数。

Final hint: I suggest to write plenty of unit tests to test your algorithm (such stuff is perfect for unit tests); 最后提示:我建议编写大量的单元测试来测试您的算法(这类内容非常适合单元测试); and make sure that you have all your corner cases in such tests. 并确保您在此类测试中拥有所有重要案例。 You want to be 100% sure that your algorithm is correct before going for your 10^5 element arrays! 您想要在使用10 ^ 5元素数组之前100%确保算法正确!

And here, as promised, the original answer: 和这里一样,原始的答案:

Simply spoken: iterating and counting is the most efficient way to solve this problem. 简单地说:迭代和计数是解决此问题的最有效方法。 So in your above case, leaving out the sorting might lead to quicker overall execution time. 因此,在上述情况下,不进行排序可能会缩短整体执行时间。

The logic there is really simple: in order to know the count of numbers smaller than x ... you have to look at all of them. 那里的逻辑真的很简单:为了知道小于x的数字计数,您必须查看所有这些数字。 Thus you have to iterate the complete array (when that array is not sorted). 因此,您必须迭代整个数组(当该数组未排序时)。

Thus, given your initial statement, there is no other thing than: iterate and count. 因此,给定您的初始声明,没有其他事情了:迭代并计数。

Of course, if you have to this counting multiple times ... it might be worth sorting that data initially. 当然,如果您必须多次进行计数...可能值得一开始对数据进行排序。 Because then you can use binary search , and getting that count you are looking for works without iterating all data. 因为这样您就可以使用二进制搜索 ,并且获得该计数就可以在不迭代所有数据的情况下寻找工作。

And: what makes you think that iterating an array with 10^5 elements is a problem? 并且:是什么让您认为迭代具有10 ^ 5个元素的数组是一个问题? In other words: are you just worried about a potential performance problem, or do you have a real performance problem? 换句话说:您只是担心潜在的性能问题,还是真正的性能问题? You see, at some point you probably had to create and fill that array. 您会看到,有时可能必须创建填充该数组。 That for sure took more time (and resources) than a simple for-loop to count entries. 当然,这比简单的for循环对条目进行计数要花费更多的时间(和资源)。 And honestly: unless we are talking some small embedded device ... 10^5 elements ... that is close to nothing , even when using slightly outdated hardware. 老实说:除非我们使用的是小型嵌入式设备... 10 ^ 5个元素...甚至在使用稍微陈旧的硬件时也几乎没有

Finally: when you are worried about runtime , the simple answer is: slice your input data, and use 2,4, 8, ... threads to count each slice in parallel! 最后:当您担心运行时时 ,简单的答案是:对输入数据进行切片,并使用2,4、8 ...线程并行计算每个切片! But as said: before writing that code, I would do some profiling be sure that you really have to spent precious development time on this. 但是如前所述:在编写该代码之前,我将进行一些性能分析,以确保您确实必须为此花费宝贵的开发时间。 Don't solve hypothetical performance problems; 不要解决假设的性能问题; focus on those that really matter to you or your users! 专注于对您或您的用户真正重要的内容!

Comapring every item in the array with x will take you O(n) time. 将数组中的每个项目与x共同映射将花费O(n)时间。 Sorting the array will take O(n log n), and then you can use binary search, which is O(log n), and you get a total of O(n log n). 对数组进行排序将得到O(n log n),然后可以使用二进制搜索,即O(log n),则总数为O(n log n)。 So the most efficient way is also the trivial one - just loop thru the array and compare every item with x. 因此,最有效的方法也是简单的方法-只需遍历数组并将每个项目与x进行比较。

public static void main(String arg[] ){
    int b[]={2, 4, 3, 4, 11, 13, 17};
    int x=5;
     int count=0;
     for(int i=0;i<b.length;i++){
         if(b[i]<x){          
             count++;
         }
     }
     System.out.println(count);
}

I propose you to consider the following approach, but it works only if the b array has non-negative numbers. 我建议您考虑使用以下方法,但是仅当b数组具有非负数时,该方法才有效。 The algorithm works even if input arrays (both a and b ) are not sorted. 即使未对输入数组( ab )进行排序,该算法也有效。

Pseudo-code 伪码

  1. Get the max element of array b . 获取数组bmax元素。
  2. Create a new array c of size max + 1 and put 1 in the position c[b[i]] . 创建一个大小为max + 1的新数组c ,并将1放在位置c[b[i]]
  3. Create a new array d of size max + 1 and populate it as follow: 创建一个大小为max + 1的新数组d ,并将其填充如下:

    d[0]=0;
    d[i]=d[i-1] + c[i];

  4. Create a new array e of size n and populate it as follow: 创建一个大小为n的新数组e ,并将其填充如下:

    if(a[i] > max) then e[i] = last(d)
    otherwise e[i]=d[a[i]-1];

e array represents the solution: it contains in i-th position the counter of numbers of the b array lower then the i-th element of array a . e数组表示解决方案:它在第i个位置包含b数组的编号计数器,其数量低于数组a的第i个元素。 This example should be more explicative than the pseudo-code: 此示例应比伪代码更具解释性:

a = [5, 1, 4, 8, 17, 12, 22]
b = [2, 4, 3, 4, 11, 13, 17]
c = [0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1]
d = [0, 0, 1, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6]
e = [3, 0, 2, 3, 5, 4, 6]

Complexity 复杂

Steps 1, 2 and 4 are O(n).
Step 3 is O(max(b))

if the input array b contains only "short" numbers (max(b) is in the same order of n size) the algorithm perform in O(n) . 如果输入数组b仅包含“短”数字(max(b)的大小为n的相同顺序),则算法以O(n)执行。 The algorithm could be optimized creating an array of size max-min+1 and consider counter 0 for all the elements of a array lower than min(b) . 可以对算法进行优化,以创建大小为max-min+1的数组,并为小于min(b) a数组的所有元素考虑计数器0

A simple java implementation: 一个简单的java实现:

int a[] = {5, 1, 4, 8, 17, 12, 22};
int b[] = {2, 4, 3, 4, 11, 13, 17};
int max = Arrays.stream(b).max().getAsInt();
int c[] = new int[max+1];
int d[] = new int[max+1];
int e[] = new int[a.length];
for(int i=0;i<b.length;i++){
    c[b[i]]=1;
}
for(int i=1;i<c.length;i++){
    d[i] = d[i-1] + c[i];
}
for (int i = 0; i<a.length;i++){
    e[i]=(a[i]>max)?d[d.length-1]:d[a[i]-1];
}
System.out.println(Arrays.toString(a));
System.out.println(Arrays.toString(b));
System.out.println(Arrays.toString(c));
System.out.println(Arrays.toString(d));
System.out.println(Arrays.toString(e));

For larger sorted set we need to use Divide-And-Conquer principle to fasten our search.Here is my solution which has O(logn) Time complexity and O(n) space complexity. 对于更大的排序集,我们需要使用分而治之原理来加快搜索速度。这是我的解决方案,具有O(logn)时间复杂度和O(n)空间复杂度。

public static void main(String arg[]) {
    int x = 5;
    int b[] = {2, 4, 3, 4, 11, 13, 17};
    int high = b.length - 1;
    int low = 0;

    while (high >= low) {
      int mid = (high + low) / 2;
        if (b[mid] < x)
          low = mid + 1;
        else
          high = mid - 1;
    }
  System.out.println(low);

} }

This should be a possible solution. 这应该是一个可能的解决方案。 The"expensive" task is the sorting of the lists. “昂贵”的任务是对列表进行排序。 The bost list must be sorted before the for loop. Bost列表必须在for循环之前排序。 Make sure you use a fast mechanism to execute the sorting. 确保使用快速机制执行排序。 As explaned a sort on an array /array list is a very expension operation especially if there are many values you have to sort. 解释说,对数组/数组列表进行排序是一项非常昂贵的操作,尤其是当您必须对许多值进行排序时。

public static void main(String[] args) throws IOException {
    // int x = 5;
    int a[] = { 1, 2, 3, 4, 5 };
    int b[] = { 2, 4, 3, 4, 11, 13, 17 };
    List<Integer> listA = new LinkedList<>();
    for (int i : a) {
        listA.add(i);
    }
    List<Integer> listB = new LinkedList<>();
    for (int i : b) {
        listB.add(i);
    }
    Collections.sort(listA);
    Collections.sort(listB);
    int smallerValues = 0;
    int lastValue = 0;
    Iterator<Integer> iterator = listB.iterator();
    int nextValue = iterator.next();
    for (Integer x : listA) {
        while (nextValue < x && iterator.hasNext()) {
            lastValue = nextValue;
            nextValue = iterator.next();
            if (nextValue > lastValue) {
                smallerValues++;
            }
        }
        System.out.println(x + " - " + smallerValues);
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM