如何在C＃中实现维特比算法来拆分连词？

Question

简而言之-我想将此处的第一个答案从Python转换为C＃。 我当前的分割连词的解决方案是指数式的，我想要一个线性解决方案。 我假设输入文本中没有空格且大小写一致。

背景

我希望使用C＃将诸如“ wickedweather”之类的组合字符串转换为单独的单词，例如“ wicked weather”。 我创建了一个有效的解决方案，即使用指数时间的递归函数，这对于我的目的而言效率不高（至少处理100个以上的连接词）。 在这里，到目前为止，我已经阅读了一些问题，我认为这些问题可能会有所帮助，但是我无法将其回答从Python转换为C＃。

我当前的递归解决方案

这适用于只想在C＃中拆分几个词（<50）并且不真正在乎效率的人。

我当前的解决方案可以计算出所有可能的单词组合，找到最可能的输出并显示。 我目前将最可能的输出定义为使用最长的单个单词的输出-我宁愿使用其他方法。 这是我当前使用递归算法的解决方案。

static public string find_words(string instring)
    {
        if (words.Contains(instring)) //where words is my dictionary of words
        {
            return instring;
        }
        if (solutions.ContainsKey(instring.ToString()))
        {
            return solutions[instring];
        }

        string bestSolution = "";
        string solution = "";

        for (int i = 1; i < instring.Length; i++)
        {
            string partOne = find_words(instring.Substring(0, i));
            string partTwo = find_words(instring.Substring(i, instring.Length - i));
            if (partOne == "" || partTwo == "")
            {
                continue;
            }
            solution = partOne + " " + partTwo;
            //if my current solution is smaller than my best solution so far (smaller solution means I have used the space to separate words fewer times, meaning the words are larger)
            if (bestSolution == "" || solution.Length < bestSolution.Length) 
            {
                bestSolution = solution;
            }
        }
        solutions[instring] = bestSolution;
        return bestSolution;
    }

该算法依赖于输入文本中没有空格或其他符号（这在这里并不是真正的问题，我不对标点符号拆分感到困惑）。 除非我将字母表中的每个字母都存储为字典中的“单词”，否则在字符串中随机添加其他字母可能会导致错误。 这意味着，当我希望输出“ wicked weather dykjs”时，“ wickedweatherdykjs”将使用上述算法返回“ wicked weather dykjs”。

我更新的指数解决方案：

    static List<string> words = File.ReadLines("E:\\words.txt").ToList(); 
    static Dictionary<char, HashSet<string>> compiledWords = buildDictionary(words);

    private void btnAutoSpacing_Click(object sender, EventArgs e)
    {
        string text = txtText.Text;
        text = RemoveSpacingandNewLines(text); //get rid of anything that breaks the algorithm
        if (text.Length > 150)
        {
            //possibly split the text up into more manageable chunks?
            //considering using textSplit() for this.
        }
        else 
        {
            txtText.Text = find_words(text);
        }
    }

    static IEnumerable<string> textSplit(string str, int chunkSize)
    {
        return Enumerable.Range(0, str.Length / chunkSize)
            .Select(i => str.Substring(i * chunkSize, chunkSize));
    }

    private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
    {
        var dictionary = new Dictionary<char, HashSet<string>>();

        foreach (var word in words)
        {
            var key = word[0];

            if (!dictionary.ContainsKey(key))
            {
                dictionary[key] = new HashSet<string>();
            }

            dictionary[key].Add(word);
        }

        return dictionary;
    }

    static public string find_words(string instring)
    {
        string bestSolution = "";
        string solution = "";

        if (compiledWords[instring[0]].Contains(instring))
        {
            return instring;
        }

        if (solutions.ContainsKey(instring.ToString()))
        {
            return solutions[instring];
        }

        for (int i = 1; i < instring.Length; i++)
        {
            string partOne = find_words(instring.Substring(0, i));
            string partTwo = find_words(instring.Substring(i, instring.Length - i));
            if (partOne == "" || partTwo == "")
            {
                continue;
            }
            solution = partOne + " " + partTwo;
            if (bestSolution == "" || solution.Length < bestSolution.Length)
            {
                bestSolution = solution;
            }
        }
        solutions[instring] = bestSolution;
        return bestSolution;
    }

我想如何使用维特比算法

我想创建一种算法，为连串字符串找出最可能的解决方案，其中概率是根据我提供算法的文本文件中单词的位置计算的。 假设文件以英语中最常见的词开头，然后在第二行中以最常见的词开头，依此类推，直到字典中最不常见的词。 看起来大概是这样

的
是
和
...
律师

这是我想使用的此类文本文件的一个小示例链接。 这是我要使用的更大的文本文件

该文件定位的逻辑如下...

可以合理地假设它们遵循齐普夫定律，即单词列表中排名为n的单词的概率约为1 /（n log N），其中N是词典中的单词数。

Generic Human 在他出色的Python解决方案中，比我能更好地解释这一点。 我想将他的问题解决方案从Python转换为C＃，但是尝试了许多小时之后，我仍然无法生成有效的解决方案。 我也持开放态度，也许维特比算法的相对频率不是分割单词的最佳方法，还有其他建议使用C＃创建解决方案吗？

Answer 1

Viterbi算法无法为您提供帮助，但是我将为您提供2美分的当前方法。 从您的代码中，并不能完全清楚是什么words 。 如果您没有选择好的数据结构，这可能是一个真正的瓶颈。 作为一种直觉，我最初会使用Dictionary<char, HashSet<string>> ，其中键是每个单词的第一个字母：

private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
    var dictionary = new Dictionary<char, HashSet<string>>();

    foreach (var word in words)
    {
        var key = word[0];

        if (!dictionary.ContainsKey(key))
        {
            dictionary[key] = new HashSet<string>();
        }

        dictionary[key].Add(word);
    }

    return dictionary;
}

而且我还考虑将其序列化到磁盘上，以避免每次构建它。

不知道您可以像这样进行多少改进（没有您当前实施的完整信息），而是对其进行基准测试，看看是否有任何改进。

注意：我假设所有单词都大小写一致。

Answer 2

书面文本具有高度的上下文关系，您可能希望使用马尔可夫链为句子结构建模，以估计联合概率。 不幸的是，句子结构打破了维特比的假设-但仍然有希望，维特比算法是分支定界优化（又称“修剪的动态编程”）的一种情况（我在论文中已经证明了这一点），因此即使代价高昂，不满足拼接假设，您仍然可以制定成本界限并修剪候选解决方案的数量。 但是，现在让我们将马尔可夫链放在一边...假设概率是独立的，并且每个概率都遵循Zipf定律，那么您需要知道的是，维特比算法可以累积累加成本。

对于独立事件，联合概率是各个概率的乘积，因此负对数概率是成本的不错选择。

因此，您的单步成本为-log(P)或log(1/P) ，即log(index * log(N)) ，即log(index) + log(log(N)) ，后一项为一个常数。

如何在C＃中实现维特比算法来拆分连词？

问题描述

背景

我当前的递归解决方案

我更新的指数解决方案：

我想如何使用维特比算法

2 个解决方案

解决方案1
2 2016-12-07 22:24:04

解决方案2
1 2016-12-07 22:41:34

如何在C＃中实现维特比算法来拆分连词？

问题描述

背景

我当前的递归解决方案

我更新的指数解决方案：

我想如何使用维特比算法

2 个解决方案

解决方案1 2 2016-12-07 22:24:04

解决方案2 1 2016-12-07 22:41:34

解决方案1
2 2016-12-07 22:24:04

解决方案2
1 2016-12-07 22:41:34