简体   繁体   中英

How to match 2 strings by X% (i.e. >90% matching)

For example:

S1: "some filename contains few words.txt"
S2: "some filename contains few words - draft.txt"
S3: "some filename contains few words - another draft.txt"
S4: "some filename not contains few words.txt"

Important to note that I can get S2 or S3 for the 1st string and the others to match.

EDITED: I have the "master" string, and I need to find matches.

Lets say that in the first round, I found the typos.

Now I have to match only whole words.

I want to be able to decide that 5 out of 7 words are match, or 7 out of 10. The exact number of "X out of Y" is less important.

The important thing is how to find that the difference is X words, no metter where they are in the sentence.

Thanks

This isn't a regex problem.

You don't specify a language, but if you're using java, there's the getLevenshteinDistance method of StringUtils. From the javadocs:

Find the Levenshtein distance between two Strings.

This is the number of changes needed to change one String into another, where each change is a single character modification (deletion, insertion or substitution).

Usage:

int distance = StringUtils.getLevenshteinDistance(
    "some filename contains few words.txt",
    "some filename not contains few words.txt"
);

To match by some percentage, you have to decide which string is the "master" since the input strings can have different lengths: that the distance might be all deletions, so "cat" and "cataract" have a distance of 5 . Defining what a "90% match" should be is also a bit difficult. Look at our cat example; 100% of the string "cat" is found in "cataract", but they're not exactly the same string. You'll have to decide these rules depending on your use-case.

update

If your "difference" should be word-based, it'd be relatively easy to split the string on word boundaries and construct a Map from the resultant word to the count for each word. Comparing the generated maps for each string should then give you a rough "similarity" measurement. For example:

public HashMap<String, Integer> countWords(String str) {
    HashMap<String, Integer> counts = new HashMap<String, Integer>();
    for(String s : str.split("\\s+")) {
        if(!s.isEmpty()) {
            if(counts.containsKey(s)) {
                counts.put(s, counts.get(s) + 1);
            } else {
                counts.put(s, 1);
            }
        }
    }
    return counts;
}

// ...

String s1 = "some filename contains few words.txt";
String s2 = "some filename not contains few words.txt";
HashMap<String, Integer> s1Counts = countWords(s1);
HashMap<String, Integer> s2Counts = countWords(s2);
// assume s1 is "master" string, count the total number of words
int s1Total = 0, s2Total = 0;
for(Integer i : s1Counts.values()) {
    s1Total += i;
}
// iterate over words in s1, find the number of matching words in s2
for(Map.Entry<String, Integer> entry : s1Counts.entrySet()) {
    if(s2Counts.containsKey(entry.getKey())) {
        if(s2Counts.get(entry.getKey()) >= entry.getValue()) {
            s2Total += entry.getValue();
        } else {
            s2Total += s2Counts.get(entry.getKey());
        }
    }
}
// result
System.out.println(s2Total + " out of " + s1Total + " words match.");

I think is worth mentioning to take a look at the Apache commons-text class JaroWinklerDistance

Find the Jaro Winkler Distance which indicates the similarity score between two CharSequences.
 distance.apply(null, null)          = IllegalArgumentException
 distance.apply("","")               = 0.0
 distance.apply("","a")              = 0.0
 distance.apply("aaapppp", "")       = 0.0
 distance.apply("frog", "fog")       = 0.93
 distance.apply("fly", "ant")        = 0.0
 distance.apply("elephant", "hippo") = 0.44
 distance.apply("hippo", "elephant") = 0.44
 distance.apply("hippo", "zzzzzzzz") = 0.0
 distance.apply("hello", "hallo")    = 0.88
 distance.apply("ABC Corporation", "ABC Corp") = 0.93
 distance.apply("D N H Enterprises Inc", "D & H Enterprises, Inc.") = 0.95
 distance.apply("My Gym Children's Fitness Center", "My Gym. Childrens Fitness") = 0.92
 distance.apply("PENNSYLVANIA", "PENNCISYLVNIA")    = 0.88

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM