简体   繁体   中英

Comparing if two list of strings are equal using hashcode?

I am writing a Java/JEE client server application. I have a requirement were the files present in the server should match with the files present in the client. I am only trying to validating if there is an exact match to the file names and number of files in a specific directory.

Example of what is required:

Server
   DirectoryA
        FileA 
        FileB
        FileC

Client
   DirectoryA
       FileA
       FileB
       FileC

What would be the most efficient way for the server to make sure that all clients have the same files, assuming I can have over 100 clients and that I do not want my client/server communication to be too chatty.

Here is my current approach is using a REST API and REST Client:

Server:

  1. Find list of files in the target directory
  2. Create a checksum for the directory by making use of hashcode derived by file names and summing it up with number 31.

Clients:

  1. Upon receiving a request to verify integrity of the target directory, the client takes the checksum provided by the server and runs the same algorithm to generate checksum on local directory. `
  2. If the checksum matches the client responds to the server as success.

Is this approach correct?

Is this approach correct?

The approach is correct, but the proposed implementation is not (IMO).

I assume that "summing with 31" means something like this

  int hash = 0;
  for (String name : names) 
       hash = hash * 31 + name.hashCode();

Java hashcode values are 32 bit quantities. If we assume that the filenames are distributed uniformly, that means that there is a chance of 1 in 2^32 that two different sets of filenames will have the same hash (as calculated above). In other words, a "hash collision".

An algorithm that gets it wrong one time in 4 billion times is probably not acceptable. Worse still, if the algorithm is known, then someone can trivially manufacture a situation (ie a set of filenames) where the algorithm gives the wrong answer.

If you want to avoid these problems, you need longer checksums. If you want to protect against people manufacturing collisions, then you need to use a cryptographically strong hash / checksum. MD5 is a popular choice.

But if it was me, I would also consider just sending a complete list of filenames ... or using the (cheap) hashcode-based checksum as a just a hint that the directory contents could be the same. (Whether the latter makes sense depends on what you need to do next.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM