简体   繁体   中英

searching list of tens or few hundreds short text strings, sorting by relevance

I have a list of people that I'd like to search through. I need to know 'how much' each item matches the string it is being tested against.

The list is rather small, currently 100+ names, and it probably won't reach 1000 anytime soon.
Therefore I assumed it would be OK to keep the whole list in memory and do the searching using something Java offers out-of-the-box or using some tiny library that just implements one or two testing algorithms. (In other words without bringing-in any complicated/overkill solution that stores indexes or relies on a database.)

What would be your choice in such case please?

EDIT: Seems like Levenshtein has closest to what I need from what has been adviced. Only that gets easily fooled when the search query is "John" and the names in list are significantly longer.

If you are looking for a 'how much' match, you should use Soundex . Here is a Java implementation of this algorithm.

You should look at various string comparison algorithms and see which one suits your data best. Options are Jaro-Winkler, Smith-Waterman etc. Look up SimMetrics - a F/OSS library that offers a very comprehensive set of string comparison algorithms.

According to me Jaro-Winkler algorithm will suit your requirement best. Here is a Short summary of Jaro-Winkler Distance Algo One of the PDF which compares different algorithms --> Link to PDF

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM