简体   繁体   English

并行排序列表而不在 Java 8 中创建临时数组

[英]Sorting a List in parallel without creating a temporary array in Java 8

Java 8 provides java.util.Arrays.parallelSort , which sorts arrays in parallel using the fork-join framework. Java 8 提供了java.util.Arrays.parallelSort ,它使用 fork-join 框架对数组进行并行排序。 But there's no corresponding Collections.parallelSort for sorting lists.但是没有相应的Collections.parallelSort用于排序列表。

I can use toArray , sort that array, and store the result back in my list, but that will temporarily increase memory usage, which if I'm using parallel sorting is already high because parallel sorting only pays off for huge lists.我可以使用toArray ,对该数组进行排序,并将结果存储回我的列表中,但这会暂时增加内存使用量,如果我使用并行排序,内存使用量已经很高,因为并行排序只会对巨大的列表产生回报。 Instead of twice the memory (the list plus parallelSort's working memory), I'm using thrice (the list, the temporary array and parallelSort's working memory).而不是两倍的内存(列表加上 parallelSort 的工作内存),我使用了三次(列表、临时数组和 parallelSort 的工作内存)。 (Arrays.parallelSort documentation says "The algorithm requires a working space no greater than the size of the original array".) (Arrays.parallelSort 文档说“该算法需要一个不大于原始数组大小的工作空间”。)

Memory usage aside, Collections.parallelSort would also be more convenient for what seems like a reasonably common operation.撇开内存使用不谈,Collections.parallelSort 对于看似相当常见的操作也会更方便。 (I tend not to use arrays directly, so I'd certainly use it more often than Arrays.parallelSort.) (我倾向于不直接使用数组,所以我肯定会比 Arrays.parallelSort 更频繁地使用它。)

The library can test for RandomAccess to avoid trying to eg quicksort a linked list, so that can't a reason for a deliberate omission.该库可以测试RandomAccess以避免尝试对链接列表进行快速排序,因此这不能成为故意遗漏的原因。

How can I sort a List in parallel without creating a temporary array?如何在不创建临时数组的情况下对 List 进行并行排序?

There doesn't appear to be any straightforward way to sort a List in parallel in Java 8. I don't think this is fundamentally difficult;在 Java 8 中似乎没有任何直接的方法可以对List进行并行排序。我认为这从根本上来说并不困难; it looks more like an oversight to me.对我来说,这更像是一种疏忽。

The difficulty with a hypothetical Collections.parallelSort(list, cmp) is that the Collections implementation knows nothing about the list's implementation or its internal organization.假设Collections.parallelSort(list, cmp)的困难在于Collections实现对列表的实现或其内部组织一无所知。 This can be seen by examining the Java 7 implementation of Collections.sort(list, cmp) .这可以通过检查Collections.sort(list, cmp)的 Java 7 实现看出。 As you observed, it has to copy the list elements out to an array, sort them, and then copy them back into the list.正如您所观察到的,它必须将列表元素复制到数组中,对它们进行排序,然后再将它们复制回列表中。

This is the big advantage of the List.sort(cmp) extension method over Collections.sort(list, cmp) .这是List.sort(cmp)扩展方法相对于Collections.sort(list, cmp)的一大优势。 It might seem that this is merely a small syntactic advantage being able to write myList.sort(cmp) instead of Collections.sort(myList, cmp) .这似乎只是一个小的语法优势,能够编写myList.sort(cmp)而不是Collections.sort(myList, cmp) The difference is that myList.sort(cmp) , being an interface extension method, can be overridden by the specific List implementation.不同之处在于myList.sort(cmp)作为接口扩展方法,可以被特定的List实现覆盖 For example, ArrayList.sort(cmp) sorts the list in-place using Arrays.sort() whereas the default implementation implements the old copyout-sort-copyback technique.例如, ArrayList.sort(cmp)使用Arrays.sort()对列表进行就地Arrays.sort()而默认实现实现了旧的 copyout-sort-copyback 技术。

It should be possible to add a parallelSort extension method to the List interface that has similar semantics to List.sort but does the sorting in parallel.应该可以向List接口添加一个parallelSort扩展方法,该方法与List.sort具有相似的语义,但进行并行排序。 This would allow ArrayList to do a straightforward in-place sort using Arrays.parallelSort .这将允许ArrayList使用Arrays.parallelSort进行简单的就地排序。 (It's not entirely clear to me what the default implementation should do. It might still be worth it to do copyout-parallelSort-copyback.) Since this would be an API change, it can't happen until the next major release of Java SE. (我并不完全清楚默认实现应该做什么。执行 copyout-parallelSort-copyback 可能仍然值得。)由于这将是 API 更改,因此在 Java SE 的下一个主要版本之前不会发生.

As for a Java 8 solution, there are a couple workarounds, none very pretty (as is typical of workarounds).至于 Java 8 解决方案,有几个变通方法,没有一个非常漂亮(这是典型的变通方法)。 You could create your own array-based List implementation and override sort() to sort in parallel.您可以创建自己的基于数组的List实现并覆盖sort()以并行排序。 Or you could subclass ArrayList , override sort() , grab the elementData array via reflection and call parallelSort() on it.或者您可以继承ArrayList ,覆盖sort() ,通过反射获取elementData数组并对其调用parallelSort() Of course you could just write your own List implementation and provide a parallelSort() method, but the advantage of overriding List.sort() is that this works on the plain List interface and you don't have to modify all the code in your code base to use a different List subclass.当然,您可以编写自己的List实现并提供一个parallelSort()方法,但是覆盖List.sort()的优点是它适用于普通的List接口,并且您不必修改您的所有代码代码库以使用不同的List子类。

I think you are doomed to use a custom List implementation augmented with your own parallelSort or else change all your other code to store the big data in Array types.我认为您注定要使用通过您自己的parallelSort增强的自定义List实现,或者更改所有其他代码以将大数据存储在Array类型中。

This is the inherent problem with layers of abstract data types.这是抽象数据类型层的固有问题。 They're meant to isolate the programmer from details of implementation.它们旨在将程序员与实现细节隔离开来。 But when the details of implementation matter - as in the case of underlying storage model for sort - the otherwise splendid isolation leaves the programmer helpless.但是当实现的细节很重要时——就像在排序的底层存储模型的情况下一样——否则出色的隔离让程序员无能为力。

The standard List sort documents provide an example.标准List排序文档提供了一个示例。 After the explanation that mergesort is used, they say在使用归并排序的解释之后,他们说

The default implementation obtains an array containing all elements in this list, sorts the array, and iterates over this list resetting each element from the corresponding position in the array.默认实现获取一个包含此列表中所有元素的数组,对数组进行排序,并迭代此列表,从数组中的相应位置重置每个元素。 (This avoids the n2 log(n) performance that would result from attempting to sort a linked list in place.) (这避免了因尝试对链接列表进行排序而导致的 n2 log(n) 性能。)

In other words, "since we don't know the underlying storage model for a List and couldn't touch it if we did, we make a copy organized in a known way."换句话说,“由于我们不知道List的底层存储模型,如果我们知道也无法触及它,我们以已知的方式组织副本。” The parenthesized expression is based on the fact that the List "i'th element accessor" on a linked list is Omega(n), so the normal array mergesort implemented with it would be a disaster.带括号的表达式基于List上的List “第 i 个元素访问器”是 Omega(n) 的事实,因此用它实现的普通数组归并排序将是一场灾难。 In fact it's easy to implement mergesort efficiently on linked lists.事实上,在链表上高效地实现归并排序很容易。 The List implementer is just prevented from doing it.只是阻止了List实现者这样做。

A parallel sort on List has the same problem. List上的并行排序也有同样的问题。 The standard sequential sort fixes it with custom sort s in the concrete List implementations.标准顺序排序在具体的List实现中使用自定义sort来修复它。 The Java folks just haven't chosen to go there yet. Java 人员只是还没有选择去那里。 Maybe in Java 9.也许在 Java 9 中。

Use the following:使用以下内容:

yourCollection.parallelStream().sorted().collect(Collectors.toList());

This will be parallel when sorting, because of parallelStream() .由于parallelStream() ,这在排序时将是并行的。 I believe this is what you mean by parallel sort?我相信这就是你所说的并行排序?

Just speculating here, but I see several good reasons for generic sort algorithms preferring to work on arrays instead of List instances:只是在这里推测,但我看到了几个很好的理由,让通用排序算法更喜欢处理数组而不是List实例:

  • Element access is performed via method calls.元素访问通过方法调用执行。 Despite all the optimizations JIT can apply, even for a list that implements RandomAccess , this probably means a lot of overhead compared to plain array accesses which can be optimized very well.尽管 JIT 可以应用所有优化,即使对于实现RandomAccess的列表,与可以很好优化的普通数组访问相比,这可能意味着很多开销。
  • Many algorithms require copying some fragments of the array to temporary structures.许多算法需要将数组的一些片段复制到临时结构中。 There are efficient methods for copying arrays or their fragments.有复制数组或其片段的有效方法。 An arbitrary List instance on the other hand, can't be easily copied.另一方面,任意List实例不能轻易复制。 New lists would have to be allocated which poses two problems.必须分配新列表,这会带来两个问题。 First, this means allocating some new objects which is likely more costly than allocating arrays.首先,这意味着分配一些新对象可能比分配数组成本更高。 Second, the algorithm would have to choose what implementation of List should be allocated for this temporary structure.其次,算法必须选择应该为这个临时结构分配List哪个实现。 There are two obvious solutions, both bad: either just choose some hard-coded implementation, eg ArrayList , but then it could just allocate simple arrays as well (and if we're generating arrays then it's much easier if the soiurce is also an array).有两个明显的解决方案,都不好:要么选择一些硬编码的实现,例如ArrayList ,但它也可以只分配简单的数组(如果我们正在生成数组,那么如果源也是一个数组就容易多了)。 Or, let the user provide some list factory object, which makes the code much more complicated.或者,让用户提供一些列表工厂对象,这会使代码复杂得多。
  • Related to the previous issue: there is no obvious way of copying a list into another due to how the API is designed.与上一问题相关:由于 API 的设计方式,没有明显的方法可以将列表复制到另一个列表中。 The best the List interface offers is addAll() method, but this is probably not efficient for most cases (think of pre-allocating the new list to its target size vs adding elements one by one which many implementations do). List接口提供的最好的方法是addAll()方法,但这在大多数情况下可能效率不高(想想将新列表预先分配到其目标大小,而不是像许多实现那样一一添加元素)。
  • Most lists that need to be sorted will be small enough for another copy to not be an issue.大多数需要排序的列表都足够小,以至于另一个副本不会成为问题。

So probably the designers thought of CPU efficiency and code simplicity most of all, and this is easily achieved when the API accepts arrays.所以可能设计者最关心的是 CPU 效率和代码简单性,当 API 接受数组时,这很容易实现。 Some languages, eg Scala, have sort methods that work directly on lists, but this comes at a cost and probably is less efficient than sorting arrays in many cases (or sometimes there will probably just be a conversion to and from array performed behind the scenes).一些语言,例如 Scala,有直接在列表上工作的排序方法,但这是有代价的,并且在许多情况下可能比排序数组效率低(或者有时可能只是在幕后执行数组与数组的转换)。

By combining the existing answers I came up with this code.通过结合现有的答案,我想出了这段代码。
This works if you are not interested in creating a custom List class and if you don't bother to create a temporary array ( Collections.sort is doing it anyway).如果您对创建自定义 List 类不感兴趣并且不想创建临时数组(无论如何Collections.sort都在做),这会起作用。
This uses the initial list and does not create a new one as in the parallelStream solution.这将使用初始列表并且不会像在parallelStream解决方案中那样创建新列表。

// Convert List to Array so we can use Arrays.parallelSort rather than Collections.sort.
// Note that Collections.sort begins with this very same conversion, so we're not adding overhead
// in comparaison with Collections.sort.
Foo[] fooArr = fooLst.toArray(new Foo[0]);

// Multithread the TimSort. Automatically fallback to mono-thread when size is less than 8192.
Arrays.parallelSort(fooArr, Comparator.comparingStuff(Foo::yourmethod));

// Refill the List using the sorted Array, the same way Collections.sort does it.
ListIterator<Foo> i = fooLst.listIterator();
for (Foo e : fooArr) {
    i.next();
    i.set((Foo) e);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM