简体   繁体   中英

Is it worth using distinct() with collect(toSet())

When collecting the elements of a stream into a set, is there any advantage (or drawback) to also specifying .distinct() on the stream? For example:

return items.stream().map(...).distinct().collect(toSet());

Given that the set will already remove duplicates, this seems redundant, but does it offer any performance advantage or disadvantage? Does the answer depend on whether the stream is parallel/sequential or ordered/unordered?

According to the javadoc , distinct is a stateful intermediate operation.

If you literally have .distinct followed immediately by .collect , it doesn't really add any benefit. Maybe if the .distinct implementation is more performant than the Set duplication check, you might get some benefit, but if you're collecting to a set you're going to end up with the same result anyway.

If, on the other hand, .distinct occurs before your .map operation, and that particular mapping is an expensive operation, you may get some gains there because you're processing less data overall.

While you have the same result, they don't do the same thing: toSet() use HashSet , and you lose the initial ordering which is what distinct can preserve if required:

From the javadoc :

Preserving stability for distinct() in parallel pipelines is relatively expensive (requires that the operation act as a full barrier, with substantial buffering overhead), and stability is often not needed. Using an unordered stream source (such as generate(Supplier)) or removing the ordering constraint with BaseStream.unordered() may result in significantly more efficient execution for distinct() in parallel pipelines, if the semantics of your situation permit. If consistency with encounter order is required, and you are experiencing poor performance or memory utilization with distinct() in parallel pipelines, switching to sequential execution with BaseStream.sequential() may improve performance.

If you require stability, then it is distinct() . Using toSet() after would be useless (if not required by an API).

That is however useful if you have an equals implementing a partial equality:

class F {
  int a;
  int b;
  @Override int hashCode() {return Objects.hashCode(a);}
  @Override boolean equals(Object other) {
    if (other == this) return true;
    if (!(other instanceof F)) return false;
    return a == ((F)other).a;
  }
}

If you have a = F(10, 1) and b = F(10, 2) they are equals. But not all their fields are equals.

If in the list you have (b, a)

  • With toSet() you won't always have this order. You might have (b, a), etc.
  • With distinct() you preserve this information, eg: (b, a) .

This however assume some prerequisites (sequential, etc).

Note: this could be done using a TreeSet and appropriate compareTo method.

distinct will call equals/hashcode to separate items and later toSet will do the same thing (even if after distinct there is no need, but toSet can't really know that), so basically you are just duplicating the calls. It should performance slightly worse IMO. It is pretty easy to measure too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM