简体繁体 English

在大量字符串中搜索最匹配的最有效方法是什么？

[英]What is the most efficient method to search a large collection of strings for the closest match?

原文 2019-07-23 08:50:49 8 1 java/ string/ search/ document

I have a large file (400K lines of English sentences) and need to be able to search and compare each sentence to an "input" string, which is also an English sentence. 我有一个大文件（400K行英语句子），需要能够搜索每个句子并将其与“输入”字符串进行比较，该字符串也是英语句子。 I'm not concerned of a memory footprint this application would have; 我并不担心该应用程序会占用多少内存； I'm looking for the fastest way to do this. 我正在寻找最快的方法。 Currently, I have it stored as a large list of strings, and the program iterates through them all, one at a time, and compares the hamiltonian distance of each string - the one that "matches" is the one with the shortest distance. 目前，我将其存储为一大串字符串，并且该程序一次一次遍历所有字符串，并比较每个字符串的汉密尔顿距离-“匹配”的字符串是距离最短的字符串。 Is there something faster than this? 有比这更快的东西吗？

1 个解决方案

The best data structure to use here is a tree. 此处使用的最佳数据结构是一棵树。 Because in a tree, or even a search-trie (it is really written like "trie") the runtime is definitely smaller than that of a list. 因为在树上，甚至在搜索尝试中（它的确写成“ trie”），运行时间肯定比列表的运行时间小。 You could use the java implementation of TreeSet, or write yourself an own implementation of a tree. 您可以使用TreeSet的Java实现，也可以编写自己的树实现。 A search-trie or a prefix tree is a search tree, where every node of the tree is a character. 搜索树或前缀树是搜索树，其中树的每个节点都是一个字符。 A small example: you can find the image of the tree at the link https://i.stack.imgur.com/pmVCl.png 一个小例子：您可以在链接https://i.stack.imgur.com/pmVCl.png中找到树的图像

In this case, if you want to find/match the word "app", you need only 3 iterations in the whole tree-data structure. 在这种情况下，如果要查找/匹配单词“ app”，则整个树数据结构中仅需要3次迭代。 This is the most efficient way I know. 这是我所知道的最有效的方法。