简体   繁体   中英

Check how much a String sounds like another one in Java

I'd like to know if there is any class in Java able to check, using its own criteria, how much a String is equal to another one. Example :

  • William Shakespeare / William Shakespeare : might be 100%
  • William Shakespe**a**re / William Shakespe**e**re : might have above 90%
  • William Shakespeare / Shakespeare, William : might have above 70% (just examples)

I see two main candidates:

  • The Soundex encoding, implemented by Apache Commons . However, note that it's mainly meant for single, relatively short words. It won't find a similarity in your third example. Additionally, it really only works for English words.
  • The Levenshtein distance (Again implemented at Apache Commons ). This is language agnostic, but similarity for switched parts as in your third example will be relatively low (more like 40%). Modifications like the Damerau–Levenshtein distance may yield better results.

You have to use a "soft" string metric:

There are many others, see String Metrics for an overview.

The best algorith highly depends on the problem field. For example, SoundEx degrades for Eastern European names and the Hamming distance does not help you much if you want to compare the similiarity of "real world" words.

Generally, there is the levenshtein algorithm, which just outputs how many insert/update/delete operations you would have to perform (characterwise) in order to transform one string into another. Apache's StringUtils class has an implementation.

您可以使用: Class Soundex

This is called SoundEx, lookup java soundex for several implementations.

one of them is apache soundex which looks good (although I haven't used it myself).

听起来像SoundExApache Commons中提供了一个实现。

您可以尝试SoundEx算法。

String matching is very problem-specific, because most of the time you will have the same characteristics of noise in your strings to be matched, be it extra punctuation, typos or spelling errors. You will need to find an algorithm that is appropriate for the problems in your input data if you are doing this on a wide scale.

Soundex will give you a degree of confidence that two strings sound the same, but you may have to do some upfront cleaning first (like removing punctuation and tokenizing the string into separate words).

The best thing you can do is to run a test, there are an enormous amount of different algorithms you can use, levenshtein being a great one, as is soundex (although your mileage will vary with your problem area). There are also variations on those two algorithms, BTW.

I suggest having a look at the simmetrics and second string libraries which have loads of string matching implementations (of the two I prefer the second string library).

It sounds like you have an interesting problem to solve, good luck!

try SimMetrics - open source library including SoundEx and ChapmanMatchingSoundex which would give a far better score for the examples given. ie Will Shake vs Shake, Will this approach uses a matching approach on-top of SoundEx. Another metric you may want to try which although not phonetic scores very well regardless (if not better in differing name matching tasks) is the q-Grams metric in the same library.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM