简体   繁体   English

类Levenstein距离度量中的最近邻居搜索

[英]Nearest neighbor search in Levenstein-distance-like metric

I have a set of words (a 'dictionary'), and I have to find the closest word from the dictionary, given a new word. 我有一组单词(“字典”),给定一个新单词,我必须从词典中找到最接近的单词。 (I am using 'word' as a keyword, as it is actually a variable length sequence of abstract 'letters'). (我使用“单词”作为关键字,因为它实际上是抽象“字母”的可变长度序列)。

I am using a generalization of the Levenstein distance as a metric - the reason I needed to generalize is that I need specific 'cost' of exchanging two given letters - for example, I need the exchange of 'a' with 'b' to cost less from the exchange of 'a' with 'c'. 我正在使用Levenstein距离的概括作为度量标准-我需要概括的原因是我需要交换两个给定字母的特定“费用”-例如,我需要将“ a”与“ b”交换为费用更少的是从“ a”与“ c”的交换。 I guess I still have to convince myself that my generalization is still a metric. 我想我仍然必须说服自己,我的归纳仍然是一个指标。

Currently I am using the naive linear search, ie iterating over all words in the dictionary and keeping track of the smallest distance, and I am looking for a more efficient method. 目前,我正在使用朴素的线性搜索,即遍历字典中的所有单词并跟踪最小距离,并且我正在寻找一种更有效的方法。

I started reading about methods for nearest neighbor search, but the main conceptual difficulty for me is that my 'points' (words) are not embedded in a space I can visualize, and they are not vectors with dimensionality etc. 我开始阅读有关最近邻居搜索的方法,但是对我来说,主要的概念困难是我的“点”(单词)没有嵌入到我可以可视化的空间中,并且它们也不是具有维数的向量等。

With that in mind, I would like to hear some advice regarding which algorithms to look for. 考虑到这一点,我想听听一些有关寻找哪种算法的建议。

Let me re-verbalize your question, and give you a possible answer. 让我重新对您的问题进行口头说明,并给您一个可能的答案。 Without seeing your data set, I don't know which would be better for you. 没有看到您的数据集,我不知道哪个对您更好。

You already have an algorithm that, given two words, gives a distance between them. 您已经有一个算法,给定两个词,它们之间就会有一个距离。 It is based on the Levenstein distance for a path between those words, with a few modifications to the costs. 它基于这些词之间的路径的Levenstein距离,对成本进行了一些修改。 And you want to find the closest word to a given word without having to search the whole dictionary. 而且您想查找与给定单词最接近的单词,而不必搜索整个词典。

The simplest thing that I would try is to start with your word, and search through all possible sets of modifications until you find the closest word in your dictionary. 我尝试的最简单的方法是从您的单词开始,并搜索所有可能的修改集,直到在词典中找到最接近的单词。 You want a modified breadth-first search. 您想要修改的广度优先搜索。 Store (0, your_word) as the only entry in some sort of http://en.wikipedia.org/wiki/Priority_queue (a heap is easy to implement), grab the distance to a random dictionary word as your current best solution and then as long as the priority queue is not empty: (0, your_word)存储为某种http://en.wikipedia.org/wiki/Priority_queue中的唯一条目(很容易实现堆),抓住随机字典词的距离作为当前的最佳解决方案,那么只要优先级队列不为空:

Take the lowest cost element out.
If it is more expensive than your best solution:
    stop, return your best.
For each possible one step modification of that word:
    if the new word is in the dictionary and is lower cost than your best:
        improve best estimate
    else:
        store (new_cost, new_word) in the priority queue

This will cause an exponentially growing search set starting with your original word. 这将导致以您的原始单词开头的搜索集呈指数增长。 But if there is a nearby word in the dictionary, it should find that fairly quickly. 但是,如果字典中有附近的单词,它应该很快找到它。 If you go this route you may wish to put an upper bound on its search space after which you give up. 如果走这条路线,您可能希望在其搜索空间上设置一个上限,然后放弃。

This may be far from an optimal solution, but it shouldn't be too hard to program and try. 这可能不是最佳解决方案,但编程和尝试起来应该不会太困难。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM