简体   繁体   中英

Which union is more efficient: List / HashSet

I'm writing some algorithem where I need to use a collection, and the main (and only) action with them is union.

I'm going to have about 1 million objects, and I need to know which collection has the more efficient union method - The list or the HashSet (ot maybe something else?).

Thanks in advance.

I'm guessing that when you say "I will be using distinct with the List", you mean something like this:

  List l = ...
  Set result = Collectors.toSet(l.stream().distinct()).union(someOtherSet);

compared with this:

  HashSet h = ...
  Set result = h.union(someOtherSet);

Clearly the second version is more efficient. The first one has to produce an intermediate set from the list. Each time you run it.

The only thing that the first one saves is some memory (in the long term), since the intermediate set becomes unreachable after use.

And the first version can be written more simply and more efficiently as:

  List l = ...
  Set result = new HashSet(l).union(someOtherSet);

The List API has no distinct() method and no union() method.


If you actually use Collection.contains() to perform the union, then a HashSet() will be much faster than any standard List implementation. As @JBNizet states:

HashSet.contains is O(1). List.contains is O(n).

For example:

  Set result = new HashSet();
  for (Integer element: set1) {
      if (set2.contains(element)) {
          result.add(element);
      }
  }
  // result now contains the union of set1 and set2.

Almost identical code works for lists. But it is much slower.

You asked:

Ok, yeah. But how about union?

See above. This is about implementing union using contains calls.

Whats that? O(?)

See the following articles:

So the both of the unions are the same O(N) (n - size of the second collection)?

No.

  • Using HashSet: N x O(1) is O(N)
  • Using List: N x O(N) is O(N^2)

Or to be more precise:

  • Using HashSet: min(M, N) x O(1) is O(min(M, N))
  • Using List: N x O(M) is O(NM)

where N and M are the sizes of the two sets / lists. You can tweak the performance of the HashSet case by iterating the smaller of the two sets. as reflected above.


Finally, if the element type is Integer then Bitset could be more efficient than either List or HashSet . And it could use a couple of orders of magnitude less memory! Depending on the range of the integers, and the density of the sets.


That's the Java analysis. I'm not familiar with Scala but the underlying computations and complexity will be the same.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM