简体   繁体   中英

Java: The best data structure to store set of strings agains rehashing char-by-char

Given that we have a list of strings I am wondering what is the most efficient data structure to be able to verify if given string already exists, if not then add it.

My first though was HashSet<String> which has O(n) (best case) complexity when it comes to checking existence of the string. However, when it will have been overloaded, each string will be rehashed char-by-char based on hashCode algorithm: s[0]*31^(n - 1) + s[1]*31^(n - 2) +... + s[n - 1] what will lead to O(n^2) .

EDIT: I would like to emphasise, that my biggest concern here is time complexity of rehashing each very long string while rehashing HashSet because of many significant amount of quite long, unique entries.

Is there better a way to store many strings in HashSet or maybe a better data structure for such approach?

Your first thought was wrong. The answer is HashSet. Period.

Your concern about collisions isn't relevant; Trying to add an already existing string into the set doesn't do anything. The only way to get collisions is to add a lot of strings which all by pure coincidence have the same hashcode which is not going to happen... unless someone is intentionally trying to mess with you, in order to run up the server bill or deny service to legitimate users.

If that's the case, you need to pay a considerable cost: You need a cryptographically secure hashing algorithm. You can certainly do so, but it makes things much more complicated. The model would be the same (have a hashmap, with keys based on the result of custom_hash_algo("input") , and with as value a List<V> : Each value which hashes to the same key using custom_hash_algo . Then you re-implement all of Set's methods with this (literally: Make a class that extends AbstractSet<V> , where most methods are one-liners calling a method on that internal map.

custom_hash_algo would then be whatever you need. If you want to protect against someone intentionally feeding you strings that hash-collide, then either have a simple blocking mechanism (if a list for a given custom hash value has too many entries, just crash and refuse service, as the odds are at this point literally 99.9999% that the customer is messing with you or is being messed with in turn), or a cryptographically secure hash.

If you have any other reason to believe that somehow hash collisions will be more frequent than the expected 'for any 2 given strings, the odds they collide is 1 in about 4 billion', the exact same principle can be used with some other, non-cryptographic algorithm as well (and therefore not robust against intentionally creating colliding strings).


NB: In case I misunderstood your question and your only concern is that the hashCode() impl of string will look at each character: No, you can't improve on that; it is not possible to hash strings such that collisions are mostly avoided without doing so, unless you know something specific about your particular strings that doesn't apply to arbitrary strings ('they always start with a unique 8-char ID,' - okay. then maybe you can use that for a hash).

This also doesn't make it O(n^2) . when talking about algorithmic complexity, there is no way to do so without defining what n actually means. It's usually obvious from context which is why it's usually not said, but it's still a crucial part of the statement, even if unsaid. In your case, '# of strings in set' is one variable and 'average length of the strings' is another. At best you can say it is O(n*m) (with n and m defined as 'n = size of collection, m = avg length of a string in it'). Which... yeah. That's obviously the most efficient way to do it.

NB2: An important optimization opportunity when writing your Map<Integer, List<V>> -backed set impl is to have the values not be List<V> but Object , so that you can make a rule where if it is a non-list object, then there is only one object that hashes to this value in your collection, and only if there is a collision will you make a special internal list type (which is how you differentiate: Nobody but you can make that internal list type, thus if it is of that type, you know it's the collision case). That saves a LOT of overhead.

Is that constructor not an option?

public HashSet(int initialCapacity, float loadFactor)

The more general case here is: how to implement a map of an array ?

If the arrays (or Strings) frequently are very large the hash code calculation becomes a negative speed factor. So a HashMap might not be a good idea.

Then a TreeMap would be better, as a comparison often compares:

  1. same length;
  2. as long as the elements are equal the array elements are compared.

In other words:

  • Given N the number of arrays
  • Given L the average length of an array

Then for the average case:

  • HashMap: O(N * L)
  • TreeMap: approx. O(N * log(N) * log(M))

This means you have to benchmark.

This all becomes relevant when the Strings are large, say contain files. You also might "optimize" such things by compression and storing the bytes with the hash code (CRC?).

As one programs against interfaces ( Set<String> set ) the choice of implementation can be postponed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM