简体   繁体   English

如何在C#中实现维特比算法来拆分连词?

[英]How can I implement the Viterbi algorithm in C# to split conjoined words?

In short - I want to convert the first answer to the question here from Python into C#. 简而言之-我想将此处的第一个答案从Python转换为C#。 My current solution to splitting conjoined words is exponential, and I would like a linear solution. 我当前的分割连词的解决方案是指数式的,我想要一个线性解决方案。 I am assuming no spacing and consistent casing in my input text. 我假设输入文本中没有空格且大小写一致。

Background 背景

I wish to convert conjoined strings such as "wickedweather" into separate words, for example "wicked weather" using C#. 我希望使用C#将诸如“ wickedweather”之类的组合字符串转换为单独的单词,例如“ wicked weather”。 I have created a working solution, a recursive function using exponential time, which is simply not efficient enough for my purposes (processing at least over 100 joined words). 我创建了一个有效的解决方案,即使用指数时间的递归函数,这对于我的目的而言效率不高(至少处理100个以上的连接词)。 Here the questions I have read so far, which I believe may be helpful, but I cannot translate their responses from Python to C#. 在这里,到目前为止,我已经阅读了一些问题,我认为这些问题可能会有所帮助,但是我无法将其回答从Python转换为C#。

My Current Recursive Solution 我当前的递归解决方案

This is for people who only want to split a few words (< 50) in C# and don't really care about efficiency. 这适用于只想在C#中拆分几个词(<50)并且不真正在乎效率的人。

My current solution works out all possible combinations of words, finds the most probable output and displays. 我当前的解决方案可以计算出所有可能的单词组合,找到最可能的输出并显示。 I am currently defining the most probable output as the one which uses the longest individual words - I would prefer to use a different method. 我目前将最可能的输出定义为使用最长的单个单词的输出-我宁愿使用其他方法。 Here is my current solution, using a recursive algorithm. 这是我当前使用递归算法的解决方案。

static public string find_words(string instring)
    {
        if (words.Contains(instring)) //where words is my dictionary of words
        {
            return instring;
        }
        if (solutions.ContainsKey(instring.ToString()))
        {
            return solutions[instring];
        }

        string bestSolution = "";
        string solution = "";

        for (int i = 1; i < instring.Length; i++)
        {
            string partOne = find_words(instring.Substring(0, i));
            string partTwo = find_words(instring.Substring(i, instring.Length - i));
            if (partOne == "" || partTwo == "")
            {
                continue;
            }
            solution = partOne + " " + partTwo;
            //if my current solution is smaller than my best solution so far (smaller solution means I have used the space to separate words fewer times, meaning the words are larger)
            if (bestSolution == "" || solution.Length < bestSolution.Length) 
            {
                bestSolution = solution;
            }
        }
        solutions[instring] = bestSolution;
        return bestSolution;
    }

This algorithm relies on having no spacing or other symbols in the entry text (not really a problem here, I'm not fussed about splitting up punctuation). 该算法依赖于输入文本中没有空格或其他符号(这在这里并不是真正的问题,我不对标点符号拆分感到困惑)。 Random additional letters added within the string can cause an error, unless I store each letter of the alphabet as a "word" within my dictionary. 除非我将字母表中的每个字母都存储为字典中的“单词”,否则在字符串中随机添加其他字母可能会导致错误。 This means that "wickedweatherdykjs" would return "wicked weather dykjs" using the above algorithm, when I would prefer an output of "wicked weather dykjs". 这意味着,当我希望输出“ wicked weather dykjs”时,“ wickedweatherdykjs”将使用上述算法返回“ wicked weather dykjs”。

My updated exponential solution: 我更新的指数解决方案:

    static List<string> words = File.ReadLines("E:\\words.txt").ToList(); 
    static Dictionary<char, HashSet<string>> compiledWords = buildDictionary(words);

    private void btnAutoSpacing_Click(object sender, EventArgs e)
    {
        string text = txtText.Text;
        text = RemoveSpacingandNewLines(text); //get rid of anything that breaks the algorithm
        if (text.Length > 150)
        {
            //possibly split the text up into more manageable chunks?
            //considering using textSplit() for this.
        }
        else 
        {
            txtText.Text = find_words(text);
        }
    }

    static IEnumerable<string> textSplit(string str, int chunkSize)
    {
        return Enumerable.Range(0, str.Length / chunkSize)
            .Select(i => str.Substring(i * chunkSize, chunkSize));
    }

    private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
    {
        var dictionary = new Dictionary<char, HashSet<string>>();

        foreach (var word in words)
        {
            var key = word[0];

            if (!dictionary.ContainsKey(key))
            {
                dictionary[key] = new HashSet<string>();
            }

            dictionary[key].Add(word);
        }

        return dictionary;
    }

    static public string find_words(string instring)
    {
        string bestSolution = "";
        string solution = "";

        if (compiledWords[instring[0]].Contains(instring))
        {
            return instring;
        }

        if (solutions.ContainsKey(instring.ToString()))
        {
            return solutions[instring];
        }

        for (int i = 1; i < instring.Length; i++)
        {
            string partOne = find_words(instring.Substring(0, i));
            string partTwo = find_words(instring.Substring(i, instring.Length - i));
            if (partOne == "" || partTwo == "")
            {
                continue;
            }
            solution = partOne + " " + partTwo;
            if (bestSolution == "" || solution.Length < bestSolution.Length)
            {
                bestSolution = solution;
            }
        }
        solutions[instring] = bestSolution;
        return bestSolution;
    }

How I would like to use the Viterbi Algorithm 我想如何使用维特比算法

I would like to create an algorithm which works out the most probable solution to a conjoined string, where the probability is calculated according to the position of the word in a text file that I provide the algorithm with. 我想创建一种算法,为连串字符串找出最可能的解决方案,其中概率是根据我提供算法的文本文件中单词的位置计算的。 Let's say the file starts with the most common word in the English language first, and on the next line the second most common, and so on until the least common word in my dictionary. 假设文件以英语中最常见的词开头,然后在第二行中以最常见的词开头,依此类推,直到字典中最不常见的词。 It looks roughly like this 看起来大概是这样

  • the
  • be
  • and
  • ... ...
  • attorney 律师

Here is a link to a small example of such a text file I would like to use. 这是我想使用的此类文本文件的一个小示例链接。 Here is a much larger text file which I would like to use 这是我要使用的更大的文本文件

The logic behind this file positioning is as follows... 该文件定位的逻辑如下...

It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary. 可以合理地假设它们遵循齐普夫定律,即单词列表中排名为n的单词的概率约为1 /(n log N),其中N是词典中的单词数。

Generic Human, in his excellent Python solution , explains this much better than I can. Generic Human 在他出色的Python解决方案中 ,比我能更好地解释这一点。 I would like to convert his solution to the problem from Python into C#, but after many hours spent attempting this I haven't been able to produce a working solution. 我想将他的问题解决方案从Python转换为C#,但是尝试了许多小时之后,我仍然无法生成有效的解决方案。 I also remain open to the idea that perhaps relative frequencies with the Viterbi algorithm isn't the best way to split words, any other suggestions for creating a solution using C#? 我也持开放态度,也许维特比算法的相对频率不是分割单词的最佳方法,还有其他建议使用C#创建解决方案吗?

Can't help you with the Viterbi Algorithm but I'll give my two cents concerning your current approach. Viterbi算法无法为您提供帮助,但是我将为您提供2美分的当前方法。 From your code its not exactly clear what words is. 从您的代码中,并不能完全清楚是什么words This can be a real bottleneck if you don't choose a good data structure. 如果您没有选择好的数据结构,这可能是一个真正的瓶颈。 As a gut feeling I'd initially go with a Dictionary<char, HashSet<string>> where the key is the first letter of each word: 作为一种直觉,我最初会使用Dictionary<char, HashSet<string>> ,其中键是每个单词的第一个字母:

private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
    var dictionary = new Dictionary<char, HashSet<string>>();

    foreach (var word in words)
    {
        var key = word[0];

        if (!dictionary.ContainsKey(key))
        {
            dictionary[key] = new HashSet<string>();
        }

        dictionary[key].Add(word);
    }

    return dictionary;
}

And I'd also consider serializing it to disk to avoid building it up every time. 而且我还考虑将其序列化到磁盘上,以避免每次构建它。

Not sure how much improvement you can make like this (dont have full information of you current implementation) but benchmark it and see if you get any improvement. 不知道您可以像这样进行多少改进(没有您当前实施的完整信息),而是对其进行基准测试,看看是否有任何改进。

NOTE: I'm assuming all words are cased consistently. 注意:我假设所有单词都大小写一致。

Written text is highly contextual and you may wish to use a Markov chain to model sentence structure in order to estimate joint probability. 书面文本具有高度的上下文关系,您可能希望使用马尔可夫链为句子结构建模,以估计联合概率。 Unfortunately, sentence structure breaks the Viterbi assumption -- but there is still hope, the Viterbi algorithm is a case of branch-and-bound optimization aka "pruned dynamic programming" (something I showed in my thesis) and therefore even when the cost-splicing assumption isn't met, you can still develop cost bounds and prune your population of candidate solutions. 不幸的是,句子结构打破了维特比的假设-但仍然有希望,维特比算法是分支定界优化(又称“修剪的动态编程”)的一种情况(我在论文中已经证明了这一点),因此即使代价高昂,不满足拼接假设,您仍然可以制定成本界限并修剪候选解决方案的数量。 But let's set Markov chains aside for now... assuming that the probabilities are independent and each follows Zipf's law, what you need to know is that the Viterbi algorithm works on accumulating additive costs. 但是,现在让我们将马尔可夫链放在一边...假设概率是独立的,并且每个概率都遵循Zipf定律,那么您需要知道的是,维特比算法可以累积累加成本。

For independent events, joint probability is the product of the individual probabilities, making negative log-probability a good choice for the cost. 对于独立事件,联合概率是各个概率的乘积,因此负对数概率是成本的不错选择。

So your single-step cost would be -log(P) or log(1/P) which is log(index * log(N)) which is log(index) + log(log(N)) and the latter term is a constant. 因此,您的单步成本为-log(P)log(1/P) ,即log(index * log(N)) ,即log(index) + log(log(N)) ,后一项为一个常数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM