简体   繁体   中英

Most efficient collection for filtering a Java Stream?

I'm storing several Thing s in a Collection. The individual Thing s are unique, but their types aren't. The order in which they are stored also doesn't matter.

I want to use Java 8's Stream API to search it for a specific type with this code:

Collection<Thing> things = ...;
// ... populate things ...
Stream<Thing> filtered = things.stream.filter(thing -> thing.type.equals(searchType));

Is there a particular Collection that would make the filter() more efficient?

I'm inclined to think no, because the filter has to iterate through the entire collection.

On the other hand, if the collection is some sort of tree that is indexed by the Thing.type then the filter() might be able to take advantage of that fact. Is there any way to achieve this?

The stream operations like filter are not that specialized to take an advantage in special cases. For example, IntStream.range(0, 1_000_000_000).filter(x -> x > 999_999_000) will actually iterate all the input numbers, it cannot just "skip" the first 999_999_000. So your question is reduced to find the collection with the most efficient iteration.

The iteration is usually performed in Spliterator.forEachRemaining method (for non-short-circuiting stream) and in Spliterator.tryAdvance method (for short-circuiting stream), so you can take a look into the corresponding spliterator implementation and check how efficient it is. To my opinion the most efficient is an array (either bare or wrapped into list with Arrays.asList ): it has minimal overhead. ArrayList is also quite fast, but for short-circuiting operation it will check the modCount (to detect concurrent modification) on every iteration which would add very slight overhead. Other types like HashSet or LinkedList are comparably slower, though in most of applications this difference is practically insignificant.

Note that parallel streams should be used with care. For example, the splitting of LinkedList is quite poor and you may experience worse performance than in sequential case.

The most important thing to understand, regarding this question, is that when you pass a lambda expression to a particular library like the Stream API, all the library receives is an implementation of a functional interface, eg an instance of Predicate . It has no knowledge about what that implementation will do and therefore has no way to exploit scenarios like filtering sorted data via comparison. The stream library simply doesn't know that the Predicate is doing a comparison.

An implementation doing such an optimization would need an interaction of the JVM, which knows and understands the code, and the library, which knows the semantics. Such thing does not happen in current implementation and is currently far away, at least as I can see it.

If the source is a tree or sorted list and you want to benefit from that for filtering, you have to do it using APIs operating on the source, before creating the stream. Eg suppose, we have a TreeSet and want to filter it to get items within a particular range, like

// our made-up source
TreeSet<Integer> tree=IntStream.range(0, 100).boxed()
    .collect(Collectors.toCollection(TreeSet::new));
// the naive implementation
tree.stream().filter(i -> i>=65 && i<91).forEach(i->System.out.print((char)i.intValue()));

We can do instead:

tree.tailSet(65).headSet(91).stream().forEach(i->System.out.print((char)i.intValue()));

which will utilize the sorted/tree nature. When we have a sorted list instead, say

List<Integer> list=new ArrayList<>(tree);

utilizing the sorted nature is more complex as the collection itself doesn't know that it's sorted and doesn't offer operations utilizing that directly:

int ix=Collections.binarySearch(list, 65);
if(ix<0) ix=~ix;
if(ix>0) list=list.subList(ix, list.size());
ix=Collections.binarySearch(list, 91);
if(ix<0) ix=~ix;
if(ix<list.size()) list=list.subList(0, ix);
list.stream().forEach(i->System.out.print((char)i.intValue()));

Of course, the stream operations here are only exemplary and you don't need a stream at all, when all you do then is forEach

As far as I am aware, there's no such differenciation for normal streaming.

However, you might be better off when you use parallel streaming when you use a collection which is easily devideable, like ArrayList over LinkedList or any type of Set.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM