简体   繁体   中英

Does HashSet do the sort job internal?

My Set is sometimes sorted, and sometimes not.

Here is the example:

public class SetOfInteger {
    public static void main(String[] args) {
        Random rand = new Random(47);
        Set<Integer> intset = new HashSet<>();
        for (int i = 0; i < 10; i++) {
            int j = rand.nextInt(30);
            System.out.print(j + " ");
            intset.add(j);
        }
        System.out.println();
        System.out.println(intset);
    }
}

The result shows that the set is not sorted.

8 5 13 11 1 29 28 20 12 7 
[1, 20, 5, 7, 8, 11, 12, 29, 28, 13]

When I change the termination expression to i < 20 in the for statement, the result shows that the set become sorted.

8 5 13 11 1 29 28 20 12 7 18 18 21 19 29 28 28 1 20 28 
[1, 5, 7, 8, 11, 12, 13, 19, 18, 21, 20, 29, 28]

It is so strange, is it? I just don't know how to explain it, and I need some help, Thank you very much.

A HashSet does not guarantee sorted iteration, but under very specific circumstances its internal data structure may act like a bucket sort .

Specifically, for integer keys in the range [0,65535] and a table size that is greater than the largest key, the index of the bucket a key is stored in is equal to the key itself, and since the iterator iterates in bucket order, it emits the elements in sorted order.

There are some good answers all around, but none attempt to explain what exactly happens in this particular situation, so I'll limit my answer to that, rather than add another explanation of how the HashSet works. I'm taking that understanding as granted.

The default constructor of HashSet creates a set with a capacity of 16 and a load factor of 0.75. That means there are 16 bins, and this capacity is increased when you insert 16 * 0.75 = 12 unique elements.

That's why in the first case, the numbers are sorted by their remainder when divided by 16: the set started with a table size of 16, "hashing" each element to a bin by taking x % 16 . Then when there were 12 elements, it grew the table and performed a rehash (see Javier Martin's answer if that's not clear), probably growing the table to 32. (I could only find information about how it grows in the java 6 doc , which states that the number of buckets is "approximately" doubled, whatever that means.) That gave each integer under 30 its own bin, so when the set iterated over each bin in order, it iterated over the numbers in order. If you inserted numbers below 64, you'd probably find that you need to insert 32*0.75 = 24 elements before the iteration appears sorted.

Also note that this way of assigning bins is not guaranteed behavior . HashSets in other Java versions/implementations might do something more complicated with the objects' hashCode() values than simply taking a remainder. (As noted by ruakh and fluffy in the comments - thanks!)

Your question points out that item order changes as the set grows bigger. However, you can't count on the order being preserved. A Set has one guarantee: there is only one of each kind of element . There are other Set objects that provide further guarantees, but a simple HashSet provides no guarantee of order.

The re-ordering you see is simply an internal reshuffling due to the way the HashSet is stored internally. In a very simplified way of thinking, the HashSet has a certain number of "slots" to store values which is usually an odd number if not also prime. The hashcodes from getHashCode() are used to assign the object to a slot. When you have a hash code collision, then the HashSet uses the equality operator equals() to determine if the objects are in fact unique.

As you add items to a HashSet several things happen:

  • Objects are assigned to their internal slot
    • The hashcode is then further hashed to find what slot it belongs in
    • If there's a slot collision, then we test for equality. If it's the same object we discard it, if not we add it to a list in that slot
  • When the number of objects exceed the number of slots, the HashSet needs to resize itself
    • It creates a bigger set of slots which is still usually an odd or prime number
    • The existing items are remapped into the new collection of slots -- this is where order can change

The bottom line is that if the objects magically sorted themselves, that's not an implementation you can count on unless you are using a TreeSet which imposes a sort order on the set items.

Interesting question. Set uses array of linked list to store its elements. hashCode() is used to find the position (indirectly) of the object to be stored in the Set .

In case there are two objects needing to be stored in same position then the object is stored in the next slot of the linked list at that position.

The size of the array is dynamic and computed run time as per the number of objects in it. Its not sure but I assume you see your numbers as sorted because the Set might have increased the size. The hashCode() is dependent upon the number value and so would have been computed sequentially. As the size of the underlying array would have increased with the increase of size of loop. There would have been no collisions and the output is sorted.

But Still I would like to emphasis so that my answer does not lead to any misconception. HashSet does not guarantee any ordering of the elements

The iteration order of a HashSet is not defined, the only guarantee is that it is consistent: iterating over a HashSet that has not been modified will produce the same sequences.

Internally, as a commenter said, the class uses the hashCode method of each element to store them in a certain number of bins . So, for example, if it's using 20 bins then it could take o.hashCode() % 20 as the bin index. Each bin can have several items in a list, which are then distinguished by the equals method. Thus, even if the hash of an Integer is its int value, the order need not be the natural integer ordering.

Furthermore, the set monitors its load factor when inserting and removing elements; considering the fraction of free bins, the maximum bin list size, the average number of items per bin, whatever. When it considers appropriate it performs a rehash, which means changing the number of bins that is used to store the elements, so their bin index changes because the n in the o.hashCode() % n changes. Every element is "reshuffled" to its new place (this is a costly operation), thus explaining the different ordering you see after adding more elements.

You must sort it manually, because there is no guarantee that the hashset will be sorted. If you want you can use also TreeSet which will provide the functionality you want, but if you want to use HashSet anyway try this:

Set intset = new HashSet();
List sortedIntList = new ArrayList(intset);
Collections.sort(sortedIntList);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM