简体繁体 English

高效的字符串索引以进行全文本索引

[英]Efficient index for strings to do full text indexing

原文 2012-05-16 17:50:26 2 2 java/ string/ algorithm/ data-structures/ indexing

I am looking for a data structure to solve the following problem. 我正在寻找一种数据结构来解决以下问题。 Receive as input a large collection of rather short strings (say 50 million, less than 30 characters) and index them as you like. 接收大量相当短的字符串（例如5000万，少于30个字符）作为输入，并根据需要对其进行索引。 Then, answer queries where I give a new string and you provide strings from the initial set which are similar to the string provided (say, the 10 best such strings). 然后，在我给出新字符串的地方回答问题，而您从初始集合中提供的字符串与所提供的字符串相似（例如，最好的10个此类字符串）。 The notion of "similarity" would ideally be something like edit distance or Jaro-Winkler distance, or an approximation thereof, but it should be resilient to minor changes in spelling and word order, and to the addition of junk words. 理想情况下，“相似性”的概念应类似于编辑距离或Jaro-Winkler距离，或其近似值，但它应能够抵抗拼写和单词顺序的细微变化以及添加垃圾单词。 (For instance, unlike a standard indexing task, requesting "foo bar" should yield "foo" if it is indeed the closest string in the collection). （例如，与标准索引任务不同，如果请求“ foo bar”确实是集合中最接近的字符串，则应产生“ foo”）。

To give an example, suppose the string collection is {"Charles Dickens", "Mary Shelley", "Robert Stephenson"}. 例如，假设字符串集合为{“ Charles Dickens”，“ Mary Shelley”，“ Robert Stephenson”}。 Querying "Dickens, Charles" should find "Charles Dickens". 查询“狄更斯，查尔斯”应找到“查尔斯·狄更斯”。 Querying "by Shelley" should return "Mary Shelley". 查询“ by Shelley”应返回“ Mary Shelley”。

The trivial approach where you compute the similarity of the query string to all strings in the collection one by one is too slow for a large collection. 对于大型集合而言，一种简单的方法（逐个计算查询字符串与集合中所有字符串的相似性）太慢了。 What would be a good data structure to answer such queries more efficiently? 有什么好的数据结构可以更有效地回答此类查询？ Ideally, I would be looking for a good Java implementation of this. 理想情况下，我将为此寻找良好的Java实现。

2 个解决方案

Two suggestions come to mind: 我想到两个建议：

1) Pick a distance function that satisfies the triangle inequality and use a http://en.wikipedia.org/wiki/Cover_tree - might provide some speedup but probably not orders of magnitude. 1）选择一个满足三角不等式的距离函数，并使用http://en.wikipedia.org/wiki/Cover_tree-可能会加快速度，但可能不会达到数量级。

2) Guess that the closest match will include at least one stretch of k contiguous characters that is an exact match between the two strings. 2）猜测最接近的匹配项将至少包含k个连续字符的至少一部分，这是两个字符串之间的完全匹配项。 Build a data-structure that eg with hash table lookups can find all the strings in the collection that have at least k contiguous characters the same as some part of the query string, and then use your distance function to see which of the strings returned from this is the best match. 建立一个数据结构，例如使用哈希表查找可以找到集合中与查询字符串的某些部分至少具有k个连续字符的所有字符串，然后使用distance函数查看从中返回了哪些字符串这是最好的搭配。 Should be fast but will sometimes miss the right answer. 应该很快，但有时会错过正确的答案。

As an alternative to your trivial approach you can solve the problem in two steps: 作为琐碎方法的替代方法，可以分两个步骤解决问题：

Build an index of words which occurs in all string, which allows you to find sentences which contain given word. 建立在所有字符串中出现的单词索引，使您可以查找包含给定单词的句子。 This should be much smaller than 50 million (if we are speaking about natural language). 这应该比5000万小得多（如果我们谈论自然语言的话）。 And you may not care about "foop bar"->"foo" because you have only words. 而且您可能不关心“ foop bar”->“ foo”，因为您只有字。
Split your query into words. 将您的查询分解成单词。 For each word find all sentences containing this word. 对于每个单词，查找包含该单词的所有句子。 For each sentence compute the similarity with the query string using your metric. 对于每个句子，使用您的指标来计算与查询字符串的相似度。