简体   繁体   中英

Use hashcodes to compare two large strings in java?

I have two files which I am reading from, I have some lines that are found on both files. I need to write a function that will detect which lines are found in both files. Now I wrote code for this which will read the contents of file 1 and put the records in an arraylist, then read file 2, for each line in file2 I check if it is found in the arraylist, if it is found, I know it is a duplicate line. Now my problem is that I am saving full lines in the arraylist, I am wondering if it is possible to convert the line I read into a hashcode, then I will save this hashcode into the arraylist, after that, I will compare this hashcode to the hashcode for the line I am reading from file2, is this better approach to save memory?

If the two hashcodes are different, the lines are different. If the two hashcodes are the same, the lines may or may not be the same.

If you store the files in a HashSet , looking up whether a line already exists is a very fast operation. HashSet uses the hashcode internally.

It is an approach that will save memory but it won't guarantee a match. The definition of hashcodes says that they will not be unique. If you want to store a smaller version of the string then you should store a digest of the string like MD5.

Here's how you get the digest.

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
...
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] digestBytes = md.digest(string.getBytes());

MD5 is 16 bytes long so this will only save you memory if your strings are significantly longer than 8 characters (with 2 bytes per character).

But unless your files are extremely large, you really don't need to worry about memory and the HashSet answers will give you better results.

Edit:

MD5 does emit collisions but not in real world conditions. It should not be used as a cryptographic hashcode but would work fine in this circumstance. There are other digest functions such as SHA256 which have less of a chance of a collision but their digest size is larger.

You are looking for a HashSet<String> - it will perfectly fit your needs!


Example:

Set<String> file1       = ....// read line by line from file1
ArrayList<String> file2 = ... //     -     "      -     file2

for (String line : file1)
    if (file2.contains(line))
        duplicate found

If you are really worried about memory and are willing to have poorer performance in order to safe memory, you could do the following:

  1. Create an HashSet of the hash values for file 1.
  2. Create a HashSet of the hash values from file 2 that match a hash value from file 1.
  3. Create a HashSet of the lines from file 1 whose hash values are in HashSet 2.
  4. Check each line from file 2 against the HashSet 3.

You didn't mention a size limit on the files, so I'm assuming that they could be large enough to make it impossible to store all of the lines in memory.

So, I'd suggest the following approach:

  1. Concatenate the two files to create one large file.

  2. Use an "external" sorting algorithm, for example, http://code.google.com/p/externalsortinginjava/ to sort the large file.

  3. Read the sorted file, one line at a time, and compare each line to the line before it (only ever keeping two lines in memory - the current and the previous line). If the current line and the previous line are the same, then the line occurs in both original files.

The "external sort" was frequently necessary in the earlier days of computing when much less memory was available. One way of doing it was/is the Merge Sort, which was , when used with tapes (remember tapes?), known as a "tape sort". Yes, I am old :-)

If you're concerned about space/memory issues, convert the strings to base36 before storing them in the HashSet as suggested by multiple people already. To standardize things, I suggest stripping all the white space and punctuation from the string and converting it to lower case before creating a base36 equivalent. Then in the HashSet you end up with HashSet<String> where the String holds the base36 encoding of the string instead of the entire string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM