[英]searching list of tens or few hundreds short text strings, sorting by relevance
I have a list of people that I'd like to search through. 我有一个我想搜索的人的清单。 I need to know 'how much' each item matches the string it is being tested against.
我需要知道每个项目与被测试的字符串相匹配的“多少”。
The list is rather small, currently 100+ names, and it probably won't reach 1000 anytime soon. 该列表很小,目前有100多个名称,并且可能很快就不会达到1000个。
Therefore I assumed it would be OK to keep the whole list in memory and do the searching using something Java offers out-of-the-box or using some tiny library that just implements one or two testing algorithms. 因此,我认为可以将整个列表保存在内存中,并使用Java提供的现成的东西或使用仅实现一两个测试算法的小型库进行搜索。 (In other words without bringing-in any complicated/overkill solution that stores indexes or relies on a database.)
(换句话说,没有引入任何复杂的/过度的解决方案来存储索引或依赖数据库。)
What would be your choice in such case please? 在这种情况下,您会选择什么?
EDIT: Seems like Levenshtein has closest to what I need from what has been adviced. 编辑:似乎Levenshtein与我所建议的最接近。 Only that gets easily fooled when the search query is "John" and the names in list are significantly longer.
当搜索查询为“ John”并且列表中的名称明显更长时,只有这样很容易上当。
Check out Double Metaphone, an improved soundex from 1990. 查看Double Metaphone,它是1990年改进的soundex。
http://commons.apache.org/codec/userguide.html http://commons.apache.org/codec/userguide.html
http://svn.apache.org/viewvc/commons/proper/codec/trunk/src/java/org/apache/commons/codec/language/DoubleMetaphone.java?view=markup http://svn.apache.org/viewvc/commons/proper/codec/trunk/src/java/org/apache/commons/codec/language/DoubleMetaphone.java?view=markup
You should look at various string comparison algorithms and see which one suits your data best. 您应该查看各种字符串比较算法,然后看看哪种算法最适合您的数据。 Options are Jaro-Winkler, Smith-Waterman etc. Look up SimMetrics - a F/OSS library that offers a very comprehensive set of string comparison algorithms.
选项包括Jaro-Winkler,Smith-Waterman等。查找SimMetrics-一个F / OSS库,提供了非常全面的字符串比较算法集。
According to me Jaro-Winkler algorithm will suit your requirement best. 根据我的说法,Jaro-Winkler算法将最适合您的要求。 Here is a Short summary of Jaro-Winkler Distance Algo One of the PDF which compares different algorithms --> Link to PDF
这是Jaro-Winkler距离算法的简短摘要,其中PDF比较了不同的算法-> 链接到PDF
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.