简体   繁体   English

搜索数十个或几百个短文本字符串的列表,按相关性排序

[英]searching list of tens or few hundreds short text strings, sorting by relevance

I have a list of people that I'd like to search through. 我有一个我想搜索的人的清单。 I need to know 'how much' each item matches the string it is being tested against. 我需要知道每个项目与被测试的字符串相匹配的“多少”。

The list is rather small, currently 100+ names, and it probably won't reach 1000 anytime soon. 该列表很小,目前有100多个名称,并且可能很快就不会达到1000个。
Therefore I assumed it would be OK to keep the whole list in memory and do the searching using something Java offers out-of-the-box or using some tiny library that just implements one or two testing algorithms. 因此,我认为可以将整个列表保存在内存中,并使用Java提供的现成的东西或使用仅实现一两个测试算法的小型库进行搜索。 (In other words without bringing-in any complicated/overkill solution that stores indexes or relies on a database.) (换句话说,没有引入任何复杂的/过度的解决方案来存储索引或依赖数据库。)

What would be your choice in such case please? 在这种情况下,您会选择什么?

EDIT: Seems like Levenshtein has closest to what I need from what has been adviced. 编辑:似乎Levenshtein与我所建议的最接近。 Only that gets easily fooled when the search query is "John" and the names in list are significantly longer. 当搜索查询为“ John”并且列表中的名称明显更长时,只有这样很容易上当。

If you are looking for a 'how much' match, you should use Soundex . 如果您正在寻找“多少”匹配,则应使用Soundex Here is a Java implementation of this algorithm. 是此算法的Java实现。

You should look at various string comparison algorithms and see which one suits your data best. 您应该查看各种字符串比较算法,然后看看哪种算法最适合您的数据。 Options are Jaro-Winkler, Smith-Waterman etc. Look up SimMetrics - a F/OSS library that offers a very comprehensive set of string comparison algorithms. 选项包括Jaro-Winkler,Smith-Waterman等。查找SimMetrics-一个F / OSS库,提供了非常全面的字符串比较算法集。

According to me Jaro-Winkler algorithm will suit your requirement best. 根据我的说法,Jaro-Winkler算法将最适合您的要求。 Here is a Short summary of Jaro-Winkler Distance Algo One of the PDF which compares different algorithms --> Link to PDF 这是Jaro-Winkler距离算法的简短摘要,其中PDF比较了不同的算法-> 链接到PDF

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM