简体   繁体   English

实现一个简单的Trie,用于高效的Levenshtein距离计算 - Java

[英]Implementing a simple Trie for efficient Levenshtein Distance calculation - Java

UPDATE 3 更新3

Done. 完成。 Below is the code that finally passed all of my tests. 下面是最终通过我所有测试的代码。 Again, this is modeled after Murilo Vasconcelo's modified version of Steve Hanov's algorithm. 再次,这是模仿Murilo Vasconcelo的Steve Hanov算法的修改版本。 Thanks to all that helped! 感谢所有帮助!

/**
 * Computes the minimum Levenshtein Distance between the given word (represented as an array of Characters) and the
 * words stored in theTrie. This algorithm is modeled after Steve Hanov's blog article "Fast and Easy Levenshtein
 * distance using a Trie" and Murilo Vasconcelo's revised version in C++.
 * 
 * http://stevehanov.ca/blog/index.php?id=114
 * http://murilo.wordpress.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/
 * 
 * @param ArrayList<Character> word - the characters of an input word as an array representation
 * @return int - the minimum Levenshtein Distance
 */
private int computeMinimumLevenshteinDistance(ArrayList<Character> word) {

    theTrie.minLevDist = Integer.MAX_VALUE;

    int iWordLength = word.size();
    int[] currentRow = new int[iWordLength + 1];

    for (int i = 0; i <= iWordLength; i++) {
        currentRow[i] = i;
    }

    for (int i = 0; i < iWordLength; i++) {
        traverseTrie(theTrie.root, word.get(i), word, currentRow);
    }
    return theTrie.minLevDist;
}

/**
 * Recursive helper function. Traverses theTrie in search of the minimum Levenshtein Distance.
 * 
 * @param TrieNode node - the current TrieNode
 * @param char letter - the current character of the current word we're working with
 * @param ArrayList<Character> word - an array representation of the current word
 * @param int[] previousRow - a row in the Levenshtein Distance matrix
 */
private void traverseTrie(TrieNode node, char letter, ArrayList<Character> word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int minimumElement = currentRow[0];
    int insertCost, deleteCost, replaceCost;

    for (int i = 1; i < size; i++) {

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;

        if (word.get(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }

        currentRow[i] = minimum(insertCost, deleteCost, replaceCost);

        if (currentRow[i] < minimumElement) {
            minimumElement = currentRow[i];
        }
    }

    if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
        theTrie.minLevDist = currentRow[size - 1];
    }

    if (minimumElement < theTrie.minLevDist) {

        for (Character c : node.children.keySet()) {
            traverseTrie(node.children.get(c), c, word, currentRow);
        }
    }
}

UPDATE 2 更新2

Finally, I've managed to get this to work for most of my test cases. 最后,我已经成功地将其用于大多数测试用例。 My implementation is practically a direct translation from Murilo's C++ version of Steve Hanov's algorithm . 我的实现实际上是Murilo的C ++版 Steve Hanov算法的直接翻译。 So how should I refactor this algorithm and/or make optimizations? 那么我该如何重构这个算法和/或进行优化呢? Below is the code... 以下是代码......

public int search(String word) {

    theTrie.minLevDist = Integer.MAX_VALUE;

    int size = word.length();
    int[] currentRow = new int[size + 1];

    for (int i = 0; i <= size; i++) {
        currentRow[i] = i;
    }
    for (int i = 0; i < size; i++) {
        char c = word.charAt(i);
        if (theTrie.root.children.containsKey(c)) {
            searchRec(theTrie.root.children.get(c), c, word, currentRow);
        }
    }
    return theTrie.minLevDist;
}
private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int insertCost, deleteCost, replaceCost;

    for (int i = 1; i < size; i++) {

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;

        if (word.charAt(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }
        currentRow[i] = minimum(insertCost, deleteCost, replaceCost);
    }

    if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
        theTrie.minLevDist = currentRow[size - 1];
    }

    if (minElement(currentRow) < theTrie.minLevDist) {

        for (Character c : node.children.keySet()) {
            searchRec(node.children.get(c), c, word, currentRow);

        }
    }
}

Thank you everyone who contributed to this question. 谢谢所有为此问题做出贡献的人。 I tried getting the Levenshtein Automata to work, but I couldn't make it happen. 我试着让Levenshtein Automata工作,但我无法实现。

So I'm looking for suggestions on refactoring and/or optimizations regarding the above code. 所以我正在寻找有关上述代码的重构和/或优化的建议。 Please let me know if there's any confusion. 如果有任何混淆,请告诉我。 As always, I can provide the rest of the source code as needed. 与往常一样,我可以根据需要提供其余的源代码。


UPDATE 1 更新1

So I've implemented a simple Trie data structure and I've been trying to follow Steve Hanov's python tutorial to compute the Levenshtein Distance. 所以我实现了一个简单的Trie数据结构,我一直在尝试按照Steve Hanov的python教程来计算Levenshtein距离。 Actually, I'm interested in computing the minimum Levenshtein Distance between a given word and the words in the Trie, thus I've been following Murilo Vasconcelos's version of Steve Hanov's algorithm . 实际上,我有兴趣计算给定单词和Trie中单词之间的最小 Levenshtein距离,因此我一直在关注Murilo Vasconcelos的Steve Hanov算法版本 It's not working very well, but here's my Trie class: 这不是很好,但这是我的Trie课程:

public class Trie {

    public TrieNode root;
    public int minLevDist;

    public Trie() {
        this.root = new TrieNode(' ');
    }

    public void insert(String word) {

        int length = word.length();
        TrieNode current = this.root;

        if (length == 0) {
            current.isWord = true;
        }
        for (int index = 0; index < length; index++) {

            char letter = word.charAt(index);
            TrieNode child = current.getChild(letter);

            if (child != null) {
                current = child;
            } else {
                current.children.put(letter, new TrieNode(letter));
                current = current.getChild(letter);
            }
            if (index == length - 1) {
                current.isWord = true;
            }
        }
    }
}

... and the TrieNode class: ...和TrieNode类:

public class TrieNode {

    public final int ALPHABET = 26;

    public char letter;
    public boolean isWord;
    public Map<Character, TrieNode> children;

    public TrieNode(char letter) {
        this.isWord = false;
        this.letter = letter;
        children = new HashMap<Character, TrieNode>(ALPHABET);
    }

    public TrieNode getChild(char letter) {

        if (children != null) {
            if (children.containsKey(letter)) {
                return children.get(letter); 
            }
        }
        return null;
    }
}

Now, I've tried to implement the search as Murilo Vasconcelos has it, but something is off and I need some help debugging this. 现在,我试图实现搜索,因为Murilo Vasconcelos有它,但有些东西已经关闭,我需要一些帮助调试这个。 Please give suggestions on how to refactor this and/or point out where the bugs are. 请提供有关如何重构和/或指出错误位置的建议。 The very first thing I'd like to refactor is the "minCost" global variable, but that's the smallest of things. 我想重构的第一件事是“minCost”全局变量,但这是最小的事情。 Anyway, here's the code... 无论如何,这是代码......

public void search(String word) {

    int size = word.length();
    int[] currentRow = new int[size + 1];

    for (int i = 0; i <= size; i++) {
        currentRow[i] = i;
    }
    for (int i = 0; i < size; i++) {
        char c = word.charAt(i);
        if (theTrie.root.children.containsKey(c)) {
            searchRec(theTrie.root.children.get(c), c, word, currentRow);
        }
    }
}

private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int replace, insertCost, deleteCost;

    for (int i = 1; i < size; i++) {

        char c = word.charAt(i - 1);

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;
        replace = (c == letter) ? previousRow[i - 1] : (previousRow[i - 1] + 1);

        currentRow[i] = minimum(insertCost, deleteCost, replace);
    }

    if (currentRow[size - 1] < minCost && !node.isWord) {
        minCost = currentRow[size - 1];
    }
    Integer minElement = minElement(currentRow);
    if (minElement < minCost) {

        for (Map.Entry<Character, TrieNode> entry : node.children.entrySet()) {
            searchRec(node, entry.getKey(), word, currentRow);
        }
    }
}

I apologize for the lack of comments. 我为缺乏评论而道歉。 So what am I doing wrong? 那么我做错了什么?

INITIAL POST 初始发布

I've been reading an article, Fast and Easy Levenshtein distance using a Trie , in hopes of figuring out an efficient way to compute the Levenshtein Distance between two Strings. 我一直在阅读一篇文章, 使用Trie快速简便的Levenshtein距离 ,希望找到一种有效的方法来计算两个弦之间的Levenshtein距离 My main goal with this is, given a large set of words, to be able to find the minimal Levenshtein Distance between an input word(s) and this set of words. 我的主要目标是,在一大堆单词的情况下,能够找到输入单词和这组单词之间的最小Levenshtein距离。

In my trivial implementation, I compute the Levenshtein Distance between an input word and the set of words, for each input word, and return the minimum. 在我琐碎的实现中,我为每个输入单词计算输入单词和单词集之间的Levenshtein距离,并返回最小值。 It works, but it is not efficient... 它有效,但效率不高......

I've been looking for implementations of a Trie, in Java, and I've come across two seemingly good sources: 我一直在寻找Java中Trie的实现,我遇到了两个看似很好的资源:

However, these implementations seem too complicated for what I'm trying to do. 但是,这些实现对于我正在尝试的事情来说似乎太复杂了。 As I've been reading through them to understand how they work and how Trie data structures work in general, I've only become more confused. 正如我一直在阅读它们以了解它们如何工作以及Trie数据结构如何工作一般,我只会变得更加困惑。

So how would I implement a simple Trie data structure in Java? 那么我如何在Java中实现一个简单的Trie数据结构呢? My intuition tells me that each TrieNode should store the String it represents and also references to letters of the alphabet, not necessarily all letters. 我的直觉告诉我每个TrieNode应该存储它所代表的String,并且还引用字母表中的字母,而不是所有字母。 Is my intuition correct? 我的直觉是否正确?

Once that is implemented, the next task is to compute the Levenshtein Distance. 一旦实现,下一个任务是计算Levenshtein距离。 I read through the Python code example in the article above, but I don't speak Python, and my Java implementation runs out of Heap memory once I hit the recursive searching. 我在上面的文章中阅读了Python代码示例,但我不会说Python,而且一旦我进行了递归搜索,我的Java实现就会耗尽堆内存。 So how would I compute the Levenshtein Distance using the Trie data structure? 那么如何使用Trie数据结构计算Levenshtein距离? I have a trivial implementation, modeled after this source code , but it doesn't use a Trie... it is inefficient. 我有一个简单的实现,模仿这个源代码 ,但它不使用Trie ...它是低效的。

It would be really nice to see some code in addition to your comments and suggestions. 除了你的评论和建议之外,看到一些代码真的很棒。 After all, this is a learning process for me... I've never implemented a Trie... so I have plenty to learn from this experience. 毕竟,这对我来说是一个学习过程......我从来没有实现过Trie ......所以我有很多东西要学习这个经验。

Thanks. 谢谢。

ps I can provide any source code if need be. ps如果需要,我可以提供任何源代码。 Also, I've already read through and tried using a BK-Tree as suggested in Nick Johnson's blog , but its not as efficient as I think it can be... or maybe my implementation is wrong. 此外,我已经阅读并尝试使用Nick Johnson博客中建议的BK-Tree,但它的效率不如我想的那样......或者我的实现可能是错误的。

From what I can tell you don't need to improve the efficiency of Levenshtein Distance, you need to store your strings in a structure that stops you needing to run distance computations so many times ie by pruning the search space. 从我可以告诉你不需要提高Levenshtein Distance的效率,你需要将你的字符串存储在一个结构中,这个结构阻止你需要多次运行距离计算,即通过修剪搜索空间。

Since Levenshtein distance is a metric, you can use any of the metric spaces indices which take advantage of triangle inequality - you mentioned BK-Trees, but there are others eg. 由于Levenshtein距离是一个度量,你可以使用利用三角不等式的任何度量空间索引 - 你提到了BK-Trees,但还有其他例如。 Vantage Point Trees, Fixed-Queries Trees, Bisector Trees, Spatial Approximation Trees. Vantage Point Trees,Fixed-Queries Tree,Bisector Trees,Spatial Approximation Trees。 Here are their descriptions: 以下是他们的描述:

Burkhard-Keller Tree Burkhard-Keller树

Nodes are inserted into the tree as follows: For the root node pick an arbitary element from the space; 节点按如下方式插入树中:对于根节点,从空间中选择一个任意元素; add unique edge-labeled children such that the value of each edge is the distance from the pivot to that element; 添加唯一的边标记子项,使每条边的值是从枢轴到该元素的距离; apply recursively, selecting the child as the pivot when an edge already exists. 递归应用,在边缘已存在时选择子项作为轴。

Fixed-Queries Tree 固定查询树

As with BKTs except: Elements are stored at leaves; 与BKT一样,除了:元素存储在树叶上; Each leaf has multiple elements; 每片叶子都有多个元素; For each level of the tree the same pivot is used. 对于树的每个级别,使用相同的枢轴。

Bisector Tree Bisector树

Each node contains two pivot elements with their covering radius (maximum distance between the centre element and any of its subtree elements); 每个节点包含两个枢轴元素及其覆盖半径(中心元素与其任何子树元素之间的最大距离); Filter into two sets those elements which are closest to the first pivot and those closest to the second, and recursively build two subtrees from these sets. 将最接近第一个轴的元素和最接近第二个轴的元素过滤成两组,并从这些集中递归地构建两个子树。

Spatial Approximation Tree 空间逼近树

Initially all elements are in a bag; 最初所有元素都放在一个袋子里; Choose an arbitrary element to be the pivot; 选择一个任意元素作为枢轴; Build a collection of nearest neighbours within range of the pivot; 在枢轴范围内建立最近邻居的集合; Put each remaining element into the bag of the nearest element to it from collection just built; 将每个剩余的元素放入刚刚建成的集合中最近元素的包中; Recursively form a subtree from each element of this collection. 递归地从该集合的每个元素形成子树。

Vantage Point Tree 华帝点树

Choose a pivot from the set abitrarily; 从套装中选择一个支点; Calculate the median distance between this pivot and each element of the remaining set; 计算此枢轴与剩余集合的每个元素之间的中间距离; Filter elements from the set into left and right recursive subtrees such that those with distances less than or equal to the median form the left and those greater form the right. 将集合中的元素过滤为左右递归子树,使得距离小于或等于中值的那些形成左边,而更大的那些形成右边。

I've implemented the algo described on "Fast and Easy Levenshtein distance using a Trie" article in C++ and it is really fast. 我已经实现了在C ++中使用Trie“快速简便的Levenshtein距离”描述的算法,它真的很快。 If you want (understand C++ better than Python), I can past the code in somewhere. 如果你想要(比Python更好地理解C ++),我可以在某个地方通过代码。

Edit: I posted it on my blog . 编辑:我在我的博客上发布了它。

Here is an example of Levenshtein Automata in Java (EDIT: moved to github ).These will probably also be helpful: 以下是JavaLevenshtein Automata的一个例子(EDIT:转移到github )。这些可能也会有所帮助:

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/ http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ http://svn.apache.org/repos/asf/的lucene的/ dev /中继/ lucene的/ SRC /测试/组织/阿帕奇/ lucene的/ util的/自动机/

EDIT: The above links seem to have moved to github: 编辑:以上链接似乎已移至github:

https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/util/automaton https://github.com/apache/lucene-solr/tree/master/lucene/core/src/test/org/apache/lucene/util/automaton https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/util/automaton https://github.com/apache/lucene-solr/tree /主/ lucene的/核心/ SRC /测试/组织/阿帕奇/ lucene的/ util的/自动机

It looks like the experimental Lucene code is based off of the dk.brics.automaton package. 看起来实验性的Lucene代码基于dk.brics.automaton包。

Usage appears to be something similar to below: 用法似乎类似于以下内容:

LevenshteinAutomata builder = new LevenshteinAutomata(s);
Automaton automata = builder.toAutomaton(n);
boolean result1 = BasicOperations.run(automata, "foo");
boolean result2 = BasicOperations.run(automata, "bar");

In many ways, Steve Hanov's algorithm (presented in the first article linked in the question, Fast and Easy Levenshtein distance using a Trie ), the ports of the algorithm made by Murilo and you (OP), and quite possibly every pertinent algorithm involving a Trie or similar structure, function much like a Levenshtein Automaton (which has been mentioned several times here) does: 在许多方面,史蒂夫汉诺夫的算法(在第一篇文章中提到, 使用Trie快速和简便Levenshtein距离 ),由Murilo和你(OP)制作的算法的端口,很可能每个相关的算法涉及一个Trie或类似的结构,功能很像Levenshtein Automaton(这里已多次提到):

Given:
       dict is a dictionary represented as a DFA (ex. trie or dawg)
       dictState is a state in dict
       dictStartState is the start state in dict
       dictAcceptState is a dictState arrived at after following the transitions defined by a word in dict
       editDistance is an edit distance
       laWord is a word
       la is a Levenshtein Automaton defined for laWord and editDistance
       laState is a state in la
       laStartState is the start state in la
       laAcceptState is a laState arrived at after following the transitions defined by a word that is within editDistance of laWord
       charSequence is a sequence of chars
       traversalDataStack is a stack of (dictState, laState, charSequence) tuples

Define dictState as dictStartState
Define laState as laStartState
Push (dictState, laState, "") on to traversalDataStack
While traversalDataStack is not empty
    Define currentTraversalDataTuple as the the product of a pop of traversalDataStack
    Define currentDictState as the dictState in currentTraversalDataTuple
    Define currentLAState as the laState in currentTraversalDataTuple
    Define currentCharSequence as the charSequence in currentTraversalDataTuple
    For each char in alphabet
        Check if currentDictState has outgoing transition labeled by char
        Check if currentLAState has outgoing transition labeled by char
        If both currentDictState and currentLAState have outgoing transitions labeled by char
            Define newDictState as the state arrived at after following the outgoing transition of dictState labeled by char
            Define newLAState as the state arrived at after following the outgoing transition of laState labeled by char
            Define newCharSequence as concatenation of currentCharSequence and char
            Push (newDictState, newLAState, newCharSequence) on to currentTraversalDataTuple
            If newDictState is a dictAcceptState, and if newLAState is a laAcceptState
                Add newCharSequence to resultSet
            endIf
        endIf
    endFor
endWhile

Steve Hanov's algorithm and its aforementioned derivatives obviously use a Levenshtein distance computation matrix in place of a formal Levenshtein Automaton. Steve Hanov的算法及其上述衍生物显然使用Levenshtein距离计算矩阵代替正式的Levenshtein自动机。 Pretty fast, but a formal Levenshtein Automaton can have its parametric states (abstract states which describe the concrete states of the automaton) generated and used for traversal, bypassing any edit-distance-related runtime computation whatsoever. 相当快,但正式的Levenshtein自动机可以 生成并用于遍历的 参数状态 (描述自动机的具体状态的抽象状态) ,绕过任何与编辑距离相关的运行时计算。 So, it should be run even faster than the aforementioned algorithms. 因此,它应该比上述算法运行得更快。

If you (or anybody else) is interested in a formal Levenshtein Automaton solution , have a look at LevenshteinAutomaton . 如果您(或其他任何人)对正式的Levenshtein Automaton解决方案感兴趣,请查看LevenshteinAutomaton It implements the aforementioned parametric-state-based algorithm, as well as a pure concrete-state-traversal-based algorithm (outlined above) and dynamic-programming-based algorithms (for both edit distance and neighbor determination). 它实现了上述基于参数状态的算法,以及基于纯混凝土状态遍历的​​算法(如上所述)和基于动态编程的算法(用于编辑距离和邻居确定)。 It's maintained by yours truly :) . 它由你真正维护:)。

I was looking at your latest update 3, the algorithm seem not work well for me. 我正在查看您的最新更新3,该算法似乎不适合我。

Let s see you have below test cases: 我们看到你有以下测试用例:

    Trie dict = new Trie();
    dict.insert("arb");
    dict.insert("area");

    ArrayList<Character> word = new ArrayList<Character>();
    word.add('a');
    word.add('r');
    word.add('c');

In this case, the minimum edit distance between "arc" and the dict should be 1, which is the edit distance between "arc" and "arb" , but you algorithms will return 2 instead. 在这种情况下, "arc"和dict之间的最小编辑距离应为1,即"arc""arb"之间的编辑距离,但算法将返回2。

I went through the below code piece: 我看了下面的代码:

        if (word.get(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }

At least for the first loop, the letter is one of the characters in the word, but instead, you should be compare the nodes in the trie, so there will be one line duplicate with the first character in the word, is that right? 至少对于第一个循环,字母是单词中的一个字符,但相反,您应该比较trie中的节点,因此将有一行与单词中的第一个字符重复,是吗? each DP matrix has the first line as a duplicate. 每个DP矩阵的第一行都是重复的。 I executed the exact same code you put on the solution. 我执行了与解决方案完全相同的代码。

My intuition tells me that each TrieNode should store the String it represents and also references to letters of the alphabet, not necessarily all letters. 我的直觉告诉我每个TrieNode应该存储它所代表的String,并且还引用字母表中的字母,而不是所有字母。 Is my intuition correct? 我的直觉是否正确?

No, a trie doesn't represent a String, it represents a set of strings (and all their prefixes). 不,trie不表示String,它表示一组字符串(及其所有前缀)。 A trie node maps an input character to another trie node. trie节点将输入字符映射到另一个trie节点。 So it should hold something like an array of characters and a corresponding array of TrieNode references. 所以它应该包含类似字符数组和相应的TrieNode引用数组。 (Maybe not that exact representation, depending on efficiency in your particular use of it.) (可能不是那种确切的表示,取决于您特定使用它的效率。)

As I see it right, you want to loop over all branches of the trie. 正如我所看到的那样,你想要遍历特里的所有分支。 That's not that difficult using a recursive function. 使用递归函数并不困难。 I'm using a trie as well in my k-nearest neighbor algorithm, using the same kind of function. 我在使用相同类型的函数的k近邻算法中也使用了trie。 I don't know Java, however but here's some pseudocode: 我不知道Java,但是这里有一些伪代码:

function walk (testitem trie)
   make an empty array results
   function compare (testitem children distance)
     if testitem = None
        place the distance and children into results
     else compare(testitem from second position, 
                  the sub-children of the first child in children,
                  if the first item of testitem is equal to that 
                  of the node of the first child of children 
                  add one to the distance (! non-destructive)
                  else just the distance)
        when there are any children left
             compare (testitem, the children without the first item,
                      distance)
    compare(testitem, children of root-node in trie, distance set to 0)
    return the results

Hope it helps. 希望能帮助到你。

The function walk takes a testitem (for example a indexable string, or an array of characters) and a trie. 函数walk使用testitem(例如可索引字符串或字符数组)和trie。 A trie can be an object with two slots. trie可以是具有两个槽的对象。 One specifying the node of the trie, the other the children of that node. 一个指定trie的节点,另一个指定该节点的子节点。 The children are tries as well. 孩子们也在尝试。 In python it would be something like: 在python中它将是这样的:

class Trie(object):
    def __init__(self, node=None, children=[]):
        self.node = node
        self.children = children

Or in Lisp... 或者在Lisp ......

(defstruct trie (node nil) (children nil))

Now a trie looks something like this: 现在trie看起来像这样:

(trie #node None
      #children ((trie #node f
                       #children ((trie #node o
                                        #children ((trie #node o
                                                         #children None)))
                                  (trie #node u
                                        #children ((trie #node n
                                                         #children None)))))))

Now the internal function (which you also can write separately) takes the testitem, the children of the root node of the tree (of which the node value is None or whatever), and an initial distance set to 0. 现在内部函数(您也可以单独编写)接受testitem,树的根节点的子节点(节点值为None或其他),并将初始距离设置为0。

Then we just recursively traverse both branches of the tree, starting left and then right. 然后我们只是递归遍历树的两个分支,从左到右开始。

I'll just leave this here in case anyone is looking for yet another treatment of this problem: 我会把这个放在这里以防万一有人正在寻找另一种解决这个问题的方法:

http://code.google.com/p/oracleofwoodyallen/wiki/ApproximateStringMatching http://code.google.com/p/oracleofwoodyallen/wiki/ApproximateStringMatching

Correct me if I am wrong but I believe your update3 has an extra loop which is unnecesary and makes the program much slower: 如果我错了,请纠正我,但我相信你的update3有一个额外的循环,这是不必要的,并使程序更慢:

for (int i = 0; i < iWordLength; i++) {
    traverseTrie(theTrie.root, word.get(i), word, currentRow);
}

You ought to call traverseTrie only once because within traverseTrie you are already looping over the whole word. 你应该只调用一次traverseTrie,因为在traverseTrie中你已经遍历了整个单词。 The code should be only as follows: 代码应该只是如下:

traverseTrie(theTrie.root, ' ', word, currentRow);

Well, here's how I did it a long time ago. 好吧, 这是我很久以前做过的 I stored the dictionary as a trie, which is simply a finite-state-machine restricted to the form of a tree. 我将字典存储为trie,它只是一种限制为树形式的有限状态机。 You can enhance it by not making that restriction. 您可以通过不限制来增强它。 For example, common suffixes can simply be a shared subtree. 例如,常见后缀可以简单地是共享子树。 You could even have loops, to capture stuff like "nation", "national", "nationalize", "nationalization", ... 你甚至可以拥有循环,捕捉“民族”,“国家”,“国有化”,“国有化”等内容......

Keep the trie as absolutely simple as possible. 保持trie尽可能简单。 Don't go stuffing strings in it. 不要在其中填充字符串。

Remember, you don't do this to find the distance between two given strings. 请记住,您不要这样做以找到两个给定字符串之间的距离。 You use it to find the strings in the dictionary that are closest to one given string. 您可以使用它来查找字典中最接近一个给定字符串的字符串。 The time it takes depends on how much levenshtein distance you can tolerate. 花费的时间取决于你能忍受的levenshtein距离。 For distance zero, it is simply O(n) where n is the word length. 对于距离零,它只是O(n),其中n是字长。 For arbitrary distance, it is O(N) where N is the number of words in the dictionary. 对于任意距离,它是O(N),其中N是字典中的单词数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM