简体   繁体   中英

Why is HashSet performance bad in large N?

I'm trying to solve problem Nth Ugly Number . I'm trying to use HashSet to avoid add duplicate ele to PriorityQueue. I'm expecting add() contains() in HashSet is O(1), which is better than PriorityQueue add() O(log(n)). However, I found my implementation is always worse than PriorityQueue only solution.

Then, I count conflit to see duplicate ratio. It's slightly over 10% constantly. So, as N growing, using HashSet should be better(10%*log(n)>>90%*C for large n). Weird thing is as N growing, using HashSet becomes even worse. From almost same performance when n=1000,10000,100000 to 3 times worse in 1,000,000 and 4 times in 10,000,000. I've read ( Fastest Java HashSet<Integer> library ) saying 1.5n initial capacity. So, HashSet usually has 2.5~3n elements. I'm setting 4n or 5n to my HashSet. It doesn't help.

Does any know Why it happens?

public class Test {
  int conflict = 0;

  public static void main(String[] args) {
    Test test = new Test();
    long start = System.currentTimeMillis();
    int N = 10000000;
    test.nthUglyNumber(N);
    long end = System.currentTimeMillis();
    System.out.println("Time:" + (end - start));


    start = System.currentTimeMillis();
    test.nthUglyNumber2(N);
    end = System.currentTimeMillis();
    System.out.println("Time:" + (end - start));
  }

  public int nthUglyNumber(int n) {
    if (n <= 0) {
      return 0;
    }
    HashSet<Integer> CLOSED = new HashSet<Integer>(5 * n);
    PriorityQueue<Integer> OPEN = new PriorityQueue<Integer>();
    int cur = 1;
    OPEN.add(cur);
    CLOSED.add(cur);
    while (n > 1) {
      n--;
      cur = OPEN.poll();
      int cur2 = cur * 2;
      if (CLOSED.add(cur2)) {
        OPEN.add(cur2);
      }
      // else {
      // conflict++;
      // }
      int cur3 = cur * 3;
      if (CLOSED.add(cur3)) {
        OPEN.add(cur3);
      }
      // else{
      // conflict++;
      // }

      int cur5 = cur * 5;
      if (CLOSED.add(cur5)) {
        OPEN.add(cur5);
      }
      // else{
      // conflict++;
      // }
    }
    return OPEN.peek();
  }

  public int nthUglyNumber2(int n) {
    if (n == 1)
      return 1;
    PriorityQueue<Long> q = new PriorityQueue();
    q.add(1l);

    for (long i = 1; i < n; i++) {
      long tmp = q.poll();
      while (!q.isEmpty() && q.peek() == tmp)
        tmp = q.poll();

      q.add(tmp * 2);
      q.add(tmp * 3);
      q.add(tmp * 5);
    }
    return q.poll().intValue();
  }
}

I don't think that your analysis is taking account of memory management overheads. Each time the GC runs it is going to need to trace and move some or all of the reachable objects in the HashSet . While it is difficult to quantify this in the average case, in the worst case (a full GC) the extra work is O(N) .

There could also be secondary memory effects; eg the version with the HashSet will have a larger working set, which will lead to more memory cache misses. This will be most pronounced during garbage collection.

I suggest that you profile the two versions of the code to determine where the extra time is really being consumed.


If you are looking for ways to make the cache perform better:

  • look for a specialized representation of the set; eg a Bitset or a 3rd-party library.
  • consider using a LinkedHashSet and dropping entries once they have passed the window in which cache hits are possible.

Note that when there's no conflict (90% of the cases), you call add twice: one on the HashSet , and one on the PriorityQueue ; while the PrioertyQueue -only solution only calls add once.

Therefore, your HashSet adds an overhead in 90% of the cases, while speeding up only 10% of them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM