简体   繁体   English

如何有效地排序一百万个元素?

[英]How to efficiently sort one million elements?

I need to compare about 60.000 with a list of 935.000 elements and if they match I need to perform a calculation. 我需要将60.000与935.000个元素的列表进行比较,如果它们匹配,则需要进行计算。

I already implemented everything needed but the process takes about 40 min. 我已经实现了所需的一切,但是该过程大约需要40分钟。 I have a unique 7-digit number in both lists. 我在两个列表中都有一个唯一的7位数字。 The 935.000 and the 60.000 files are unsorted. 935.000和60.000文件未排序。 Is it more efficient to sort (which sort?) the big list before I try to find the element? 在尝试查找元素之前对大列表进行排序(哪种排序?)更有效吗? Keep in mind that I have to do this calculation only once a month so I don't need to repeat the process every day. 请记住,我每个月只需要执行一次此计算,因此我不必每天重复执行此过程。

Basically which is faster: 基本上哪个更快:

  • unsorted linear search 未排序线性搜索
  • sort list first and then search with another algorithm 首先对列表进行排序,然后使用其他算法进行搜索

Try it out. 试试看。

You've got Collections.sort() which will do the heavy lifting for you, and Collections.binarySearch() which will allow you to find the elements in the sorted list. 您已经有了Collections.sort()可以为您完成繁重的工作,还有Collections.binarySearch()可以让您在排序列表中查找元素。

When you search the unsorted list, you have to look through half the elements on average before you find the one you're looking for. 搜索未排序的列表时,必须平均浏览一半的元素,然后才能找到所需的元素。 When you do that 60,000 times on a list of 935,000 elements, that works out to about 当您在935,000个元素的列表上执行60,000次时,得出的结果约为

935,000 * 1/2 * 60,000 = 28,050,000,000 operations 935,000 * 1/2 * 60,000 = 28,050,000,000操作

If you sort the list first (using mergesort) it will take about n * log(n) operations. 如果首先对列表进行排序(使用mergesort),则大约需要n * log(n)个操作。 Then you can use binary search to find elements in log(n) lookups for each of the 60,000 elements in your shorted list. 然后,您可以使用二进制搜索在log(n)查找中为短缺列表中的60,000个元素中的每个元素查找元素。 That's about 那是关于

935,000 * log(935,000) + log(935,000) * 60,000 = 19,735,434 operations 935,000 *日志(935,000)+日志(935,000)* 60,000 = 19,735,434次操作

It should be a lot faster if you sort the list first, then use a search algorithm that takes advantage of the sorted list. 如果先对列表进行排序,然后使用利用已排序列表的搜索算法,则速度会快很多。

What would work quite well is to sort both lists and then iterate over both at the same time. 比较好的方法是对两个列表进行排序,然后同时遍历两个列表。

Use collections.sort() to sort the lists. 使用collections.sort()对列表进行排序。

You start with an index for each sorted list and just basically walk straight through it. 您从每个排序列表的索引开始,然后基本上直接遍历该列表。 You start with the first element on the short list and compare it to the first elements of the long list. 您从短列表的第一个元素开始,然后将其与长列表的第一个元素进行比较。 If you reach an element on the long list with an higher 7 digit number than the current number in the short list, increment your index of the short list. 如果到达长列表中的元素比短列表中的当前数字高7位数字,请增加短列表的索引。 This way there is no need to check elements twice. 这样就无需两次检查元素。

But actually, since you want to find the intersection of two lists, you might be better off just using longList.retainAll(shortList) to just get the intersection of the two lists. 但是实际上,由于要查找两个列表的交集,因此最好只使用longList.retainAll(shortList)来获取两个列表的交集。 Then you can perform whatever you want on both of the lists in about O(1) since there is no need to actually find anything. 然后,您可以在O(1)中的两个列表上执行所需的任何操作,因为不需要实际查找任何内容。

You can sort both lists and compare them element by element incrementing first or second index ( i and j in the example below) as needed: 您可以对两个列表进行排序,并根据需要按元素递增第一个或第二个索引(在下面的示例中为ij )对它们进行比较:

List<Comparable> first = ....
List<Comparable> second = ...
Collections.sort(first);
Collections.sort(second);

int i = 0;
int j = 0;
while (i < first.size() && j < second.size()) {
    if (first.get(i).compareTo(second.get(j)) == 0) {
        // Action for equals
    }
    if (first.get(i).compareTo(second.get(j)) > 0) {
        j++;
    } else {
        i++;
    }
}

The complexity of this code is O(n log(n)) where n is the biggest list size. 此代码的复杂度为O(n log(n)),其中n是最大列表大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM