简体   繁体   中英

how can I find two string are similar to each other in java?

I want to find the way for comparing strings with each other in the way that it understand there is no difference between s1 and s2 in the following examples.

String s1 = "John: would you please one the door";
String s2 = "John: would you please one the door  ????";

what should I do?

The notion of similarity between Strings is described using a String metric . A basic example of a string metric is the Levenshtein distance (often referred to as Edit distance).

Wikibooks offers a Java implementation of this algorithm : http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Java

I'm not aware of any good techniques. But getting rid of multiple spaces and interpunction might be a start.

String s1, s2;

s1 = s1.replaceAll(" {2,}", " ").replaceAll("[.?!/\\()]", "").trim();
s2 = s2.replaceAll(" {2,}", " ").replaceAll("[.?!/\\()]", "").trim();

if (s1.equalsIgnoreCase(s1))
{

}

Demo that works on your string demo: http://ideone.com/FSHOJt

Similar implies that there are commonalities. This is a nontrivial problem. What you are really asking for is a relevance score and Faceted search . This is typically done by tokenizing a string into its base words and checking for the presence of common base words within the result. As a concrete example take the sentence:

"The shadowy figure fell upon them."

You can break this down into facets:

shadow
figure
fell

Each of these can be evaluated with synonyms:

shadow -> dark, shade, silhouette,  etc...
figure -> statistic, number, quantity, amount, level, total, sum, silhouette, outline, shape, form, etc...
fell -> cut down, chop down, hack down, saw down, knock down/over, knock to the ground, strike down, bring down, bring to the ground, prostrate,  etc...

Then the same is done to the comparative string, and the common facets are counted. The more common facets the higher the relevance of the match.

There are lots of fairly heavyweight tools like Lucene and Solr in the open source community that tackle this problem, but you may be able to do a simple version by breaking the string into tokens and simply looking for common tokens. A simple example:

public class TokenExample {

    public static HashMap<String, Integer> tokenizeString(String s)
    {
        // process s1 into tokens
        HashMap<String, Integer> map = new HashMap<String, Integer>();
        for (String token : s.split("\\s+"))
        {
            // normalize the token
            token = token.toLowerCase();
            if ( map.containsKey(token) )
            {
                map.put(token, map.get(token)+1);
            }
            else
            {
                map.put(token, 1);
            }
        }
        return map;
    }

    public static Integer getCommonalityCount(String s1, String s2)
    {
        HashMap<String, Integer> map1 = tokenizeString(s1);
        HashMap<String, Integer> map2 = tokenizeString(s2);

        Integer commonIndex = 0;
        for (String token : map1.keySet())
        {
            if ( map2.containsKey(token))
            {
                commonIndex += 1;
                // you could instead count for how often they match like this
                // commonIndex += map2.get(token) + map1.get(token);
            }
        }
        return commonIndex;
    }

    public static void main(String[] args) {
        String s1 = "John: would you please one the door";
        String s2= "John: would you please one the door  ????";

        String s3 = "John: get to the door and open it please ????";
        String s4= "John: would you please one the door  ????";

        System.out.println("Commonality index: " + getCommonalityCount(s1, s2));
        System.out.println("Commonality index: " + getCommonalityCount(s3, s4));
    }
}

There are various approach to this problem, and easy way to solve this problem use Levenshtein distance. Another approach is cosine similarity. you need more details, please comment.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM