[英]What is the most efficient method to search a large collection of strings for the closest match?
I have a large file (400K lines of English sentences) and need to be able to search and compare each sentence to an "input" string, which is also an English sentence. 我有一个大文件(400K行英语句子),需要能够搜索每个句子并将其与“输入”字符串进行比较,该字符串也是英语句子。 I'm not concerned of a memory footprint this application would have; 我并不担心该应用程序会占用多少内存; I'm looking for the fastest way to do this. 我正在寻找最快的方法。 Currently, I have it stored as a large list of strings, and the program iterates through them all, one at a time, and compares the hamiltonian distance of each string - the one that "matches" is the one with the shortest distance. 目前,我将其存储为一大串字符串,并且该程序一次一次遍历所有字符串,并比较每个字符串的汉密尔顿距离-“匹配”的字符串是距离最短的字符串。 Is there something faster than this? 有比这更快的东西吗?
The best data structure to use here is a tree. 此处使用的最佳数据结构是一棵树。 Because in a tree, or even a search-trie (it is really written like "trie") the runtime is definitely smaller than that of a list. 因为在树上,甚至在搜索尝试中(它的确写成“ trie”),运行时间肯定比列表的运行时间小。 You could use the java implementation of TreeSet, or write yourself an own implementation of a tree. 您可以使用TreeSet的Java实现,也可以编写自己的树实现。 A search-trie or a prefix tree is a search tree, where every node of the tree is a character. 搜索树或前缀树是搜索树,其中树的每个节点都是一个字符。 A small example: you can find the image of the tree at the link https://i.stack.imgur.com/pmVCl.png 一个小例子: 您可以在链接https://i.stack.imgur.com/pmVCl.png中找到树的图像
In this case, if you want to find/match the word "app", you need only 3 iterations in the whole tree-data structure. 在这种情况下,如果要查找/匹配单词“ app”,则整个树数据结构中仅需要3次迭代。 This is the most efficient way I know. 这是我所知道的最有效的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.