简体   繁体   中英

Unique id for Java object

I am indexing java objects into Elasticsearch. Following is the structure of a class:

public Class Document{
    private String name;
    private double value;
    private Date date;
    private Map<String, String> attributes;
    //getters and setters
}

Before I index any object, I want to calculate/derive unique id for an object which should be based on the values of these members. If I construct another object with same values for name, date, value and attributes (ie if number and values of key value pairs are same) then, the ids should also be same.

Currently, I am using Objects.hash(Object... objects) to calculate the hashCode and set that hashCode as id. It seems to work fine. It returns same integer for objects having same values for these attributes. However, considering the amount of documents and range on int in java, the hashcode may/may not be the same(which will result in duplicate documents).

Any alternative solutions to this? Can we create an alphanumeric string (or something) depending upon these values?

Thanks in advance.

You're not going to be completely able to avoid collisions unless you use the object itself as a key ... if you wanted to do that you could serialise your values into a sequence of bytes ie 8 bytes for double 8 for date (because internal representation is long , and an arbitrary number of bytes depending on the length of your name ...

The most sensible thing to do is to use these values to calculate a hashCode, and then when a collision occurs compare each member one by one to ensure equality. This is how java Hashtable works.

If you want to go ahead and create your "definitely unique identifier" though...

byte[] defoUnique = new byte[24 + name.size()];
byte[] dateBytes = Long.toByteArray(date.getTime());
for (int i = 0 ; i < 8 ; i++) defoUnique[i] = dateBytes[i];
byte[] valueBytes = Long.toByteArray(Double.doubleToLongBits(value));
for (int i = 0 ; i < 8 ; i++) defoUnique[i+8] = valueBytes[i];
byte[] nameBytes = name.getBytes();
for (int i = 0 ; i < nameBytes.length ; i++) defoUnique[i+16] = nameBytes[i];

/* Make byte sequence into alphanumeric string */
String identifierString = Base64.getEncoder().encodeToString(defoUnique);

You should override equals() AND hashcode(). (It is common mistake to not override both together).

Below is one example. The ideas is to create a hashcode per object and test for equality (whether you get your object back or not)

EXAMPLE:

    // from http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/builder/HashCodeBuilder.html
     public class Person {
       String name;
       int age;
       boolean smoker;
       int id;  // this is your bit

       public int hashCode() {
         // you pick a hard-coded, randomly chosen, non-zero, odd number
         // ideally different for each class
         return new HashCodeBuilder(17, 37).
           append(name).
           append(age).
           append(smoker).
           toHashCode();
       }
     }

  public boolean equals(Object obj) {
  // the next 3 ifs are a 'short' circuit'
       if (obj == null) { return false; }
       if (obj == this) { return true; }
       if (obj.getClass() != getClass()) {
         return false;
       }

       // the meat of it
       MyClass rhs = (MyClass) obj;

       boolean sameClass = new EqualsBuilder()
                     .appendSuper(super.equals(obj))
                     .append(field1, rhs.field1)
                     .append(field2, rhs.field2)
                     .append(field3, rhs.field3)
                     .isEquals();

       //  here set/update your id
           if (sameClass){
                 this.id = rhs.id
           }

           return sameClass 
          }

Ended up having something like this:

/**
     * Sets the id of document by calculating hash for individual elements
     */
    public void calculateHash(){
        ByteBuffer byteBuffer = ByteBuffer.allocate(16);
        byteBuffer.putInt(Objects.hashCode(name));
        byteBuffer.putInt(Objects.hashCode(date));
        byteBuffer.putInt(Objects.hashCode(value));
        byteBuffer.putInt(Objects.hashCode(attributes));
        super.setId(DigestUtils.sha512Hex(byteBuffer.array())); 
        byteBuffer.clear();
    }

So, basically, I calculate hashes of individual elements, stuff them into a byte array and then calculate the SHA-1 hash of that. So, chances of collision are very less. Even one hash collides, it is highly unlikely that other hashes will collide too (as it is the combination of 4 hashes). I think the possibility of collision is (1/4 billion)^4 which is more than good for me :) Eg int hash can have 4 billion values, so, probability of one value is 1/(4 billion) and, probability of having the same number for other places is 1/4b x 1/4b x 1/4b x 1/4b ie (1/4b)^4 if I am not wrong.

Don't know whether it is the most appropriate (or the appropriate) way. But it seems to have worked.

thanks

hashCode() gives 32 bits, if that will risk collisions use a different hashing algo.

java.security.MessageDigest provides options in Java

I would recommend "MD5" for this which gives you a 128bit number

"MD5" = 128 bits
"SHA1" = 160 bits
"SHA-256" = 256 bits
"SHA-384" = 384 bits
"SHA-512" = 512 bits

You don't have to worry about the crypto issues with md5 or sha-1

Trade off size of hash, with the chance of collision.

There is always a risk of collision, to fully avoid it cat elements together to a string. Represent numbers in base 16,32 or 64 to save a bit of space.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM