I'd like to know if there is any class in Java able to check, using its own criteria, how much a String is equal to another one. Example :
William Shakespeare / William Shakespeare : might be 100%
William Shakespe**a**re / William Shakespe**e**re : might have above 90%
William Shakespeare / Shakespeare, William : might have above 70% (just examples)
I see two main candidates:
You have to use a "soft" string metric:
There are many others, see String Metrics for an overview.
The best algorith highly depends on the problem field. For example, SoundEx degrades for Eastern European names and the Hamming distance does not help you much if you want to compare the similiarity of "real world" words.
Generally, there is the levenshtein algorithm, which just outputs how many insert/update/delete operations you would have to perform (characterwise) in order to transform one string into another. Apache's StringUtils class has an implementation.
您可以使用: Class Soundex
This is called SoundEx, lookup java soundex for several implementations.
one of them is apache soundex which looks good (although I haven't used it myself).
听起来像SoundEx , Apache Commons中提供了一个实现。
您可以尝试SoundEx算法。
String matching is very problem-specific, because most of the time you will have the same characteristics of noise in your strings to be matched, be it extra punctuation, typos or spelling errors. You will need to find an algorithm that is appropriate for the problems in your input data if you are doing this on a wide scale.
Soundex will give you a degree of confidence that two strings sound the same, but you may have to do some upfront cleaning first (like removing punctuation and tokenizing the string into separate words).
The best thing you can do is to run a test, there are an enormous amount of different algorithms you can use, levenshtein being a great one, as is soundex (although your mileage will vary with your problem area). There are also variations on those two algorithms, BTW.
I suggest having a look at the simmetrics and second string libraries which have loads of string matching implementations (of the two I prefer the second string library).
It sounds like you have an interesting problem to solve, good luck!
try SimMetrics - open source library including SoundEx and ChapmanMatchingSoundex which would give a far better score for the examples given. ie Will Shake vs Shake, Will this approach uses a matching approach on-top of SoundEx. Another metric you may want to try which although not phonetic scores very well regardless (if not better in differing name matching tasks) is the q-Grams metric in the same library.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.