如何在C＃中實現維特比算法來拆分連詞？

Question

簡而言之-我想將此處的第一個答案從Python轉換為C＃。 我當前的分割連詞的解決方案是指數式的，我想要一個線性解決方案。 我假設輸入文本中沒有空格且大小寫一致。

背景

我希望使用C＃將諸如“ wickedweather”之類的組合字符串轉換為單獨的單詞，例如“ wicked weather”。 我創建了一個有效的解決方案，即使用指數時間的遞歸函數，這對於我的目的而言效率不高（至少處理100個以上的連接詞）。 在這里，到目前為止，我已經閱讀了一些問題，我認為這些問題可能會有所幫助，但是我無法將其回答從Python轉換為C＃。

我當前的遞歸解決方案

這適用於只想在C＃中拆分幾個詞（<50）並且不真正在乎效率的人。

我當前的解決方案可以計算出所有可能的單詞組合，找到最可能的輸出並顯示。 我目前將最可能的輸出定義為使用最長的單個單詞的輸出-我寧願使用其他方法。 這是我當前使用遞歸算法的解決方案。

static public string find_words(string instring)
    {
        if (words.Contains(instring)) //where words is my dictionary of words
        {
            return instring;
        }
        if (solutions.ContainsKey(instring.ToString()))
        {
            return solutions[instring];
        }

        string bestSolution = "";
        string solution = "";

        for (int i = 1; i < instring.Length; i++)
        {
            string partOne = find_words(instring.Substring(0, i));
            string partTwo = find_words(instring.Substring(i, instring.Length - i));
            if (partOne == "" || partTwo == "")
            {
                continue;
            }
            solution = partOne + " " + partTwo;
            //if my current solution is smaller than my best solution so far (smaller solution means I have used the space to separate words fewer times, meaning the words are larger)
            if (bestSolution == "" || solution.Length < bestSolution.Length) 
            {
                bestSolution = solution;
            }
        }
        solutions[instring] = bestSolution;
        return bestSolution;
    }

該算法依賴於輸入文本中沒有空格或其他符號（這在這里並不是真正的問題，我不對標點符號拆分感到困惑）。 除非我將字母表中的每個字母都存儲為字典中的“單詞”，否則在字符串中隨機添加其他字母可能會導致錯誤。 這意味着，當我希望輸出“ wicked weather dykjs”時，“ wickedweatherdykjs”將使用上述算法返回“ wicked weather dykjs”。

我更新的指數解決方案：

    static List<string> words = File.ReadLines("E:\\words.txt").ToList(); 
    static Dictionary<char, HashSet<string>> compiledWords = buildDictionary(words);

    private void btnAutoSpacing_Click(object sender, EventArgs e)
    {
        string text = txtText.Text;
        text = RemoveSpacingandNewLines(text); //get rid of anything that breaks the algorithm
        if (text.Length > 150)
        {
            //possibly split the text up into more manageable chunks?
            //considering using textSplit() for this.
        }
        else 
        {
            txtText.Text = find_words(text);
        }
    }

    static IEnumerable<string> textSplit(string str, int chunkSize)
    {
        return Enumerable.Range(0, str.Length / chunkSize)
            .Select(i => str.Substring(i * chunkSize, chunkSize));
    }

    private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
    {
        var dictionary = new Dictionary<char, HashSet<string>>();

        foreach (var word in words)
        {
            var key = word[0];

            if (!dictionary.ContainsKey(key))
            {
                dictionary[key] = new HashSet<string>();
            }

            dictionary[key].Add(word);
        }

        return dictionary;
    }

    static public string find_words(string instring)
    {
        string bestSolution = "";
        string solution = "";

        if (compiledWords[instring[0]].Contains(instring))
        {
            return instring;
        }

        if (solutions.ContainsKey(instring.ToString()))
        {
            return solutions[instring];
        }

        for (int i = 1; i < instring.Length; i++)
        {
            string partOne = find_words(instring.Substring(0, i));
            string partTwo = find_words(instring.Substring(i, instring.Length - i));
            if (partOne == "" || partTwo == "")
            {
                continue;
            }
            solution = partOne + " " + partTwo;
            if (bestSolution == "" || solution.Length < bestSolution.Length)
            {
                bestSolution = solution;
            }
        }
        solutions[instring] = bestSolution;
        return bestSolution;
    }

我想如何使用維特比算法

我想創建一種算法，為連串字符串找出最可能的解決方案，其中概率是根據我提供算法的文本文件中單詞的位置計算的。 假設文件以英語中最常見的詞開頭，然后在第二行中以最常見的詞開頭，依此類推，直到字典中最不常見的詞。 看起來大概是這樣

的
是
和
...
律師

這是我想使用的此類文本文件的一個小示例鏈接。 這是我要使用的更大的文本文件

該文件定位的邏輯如下...

可以合理地假設它們遵循齊普夫定律，即單詞列表中排名為n的單詞的概率約為1 /（n log N），其中N是詞典中的單詞數。

Generic Human 在他出色的Python解決方案中，比我能更好地解釋這一點。 我想將他的問題解決方案從Python轉換為C＃，但是嘗試了許多小時之后，我仍然無法生成有效的解決方案。 我也持開放態度，也許維特比算法的相對頻率不是分割單詞的最佳方法，還有其他建議使用C＃創建解決方案嗎？

Answer 1

Viterbi算法無法為您提供幫助，但是我將為您提供2美分的當前方法。 從您的代碼中，並不能完全清楚是什么words 。 如果您沒有選擇好的數據結構，這可能是一個真正的瓶頸。 作為一種直覺，我最初會使用Dictionary<char, HashSet<string>> ，其中鍵是每個單詞的第一個字母：

private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
    var dictionary = new Dictionary<char, HashSet<string>>();

    foreach (var word in words)
    {
        var key = word[0];

        if (!dictionary.ContainsKey(key))
        {
            dictionary[key] = new HashSet<string>();
        }

        dictionary[key].Add(word);
    }

    return dictionary;
}

而且我還考慮將其序列化到磁盤上，以避免每次構建它。

不知道您可以像這樣進行多少改進（沒有您當前實施的完整信息），而是對其進行基准測試，看看是否有任何改進。

注意：我假設所有單詞都大小寫一致。

Answer 2

書面文本具有高度的上下文關系，您可能希望使用馬爾可夫鏈為句子結構建模，以估計聯合概率。 不幸的是，句子結構打破了維特比的假設-但仍然有希望，維特比算法是分支定界優化（又稱“修剪的動態編程”）的一種情況（我在論文中已經證明了這一點），因此即使代價高昂，不滿足拼接假設，您仍然可以制定成本界限並修剪候選解決方案的數量。 但是，現在讓我們將馬爾可夫鏈放在一邊...假設概率是獨立的，並且每個概率都遵循Zipf定律，那么您需要知道的是，維特比算法可以累積累加成本。

對於獨立事件，聯合概率是各個概率的乘積，因此負對數概率是成本的不錯選擇。

因此，您的單步成本為-log(P)或log(1/P) ，即log(index * log(N)) ，即log(index) + log(log(N)) ，后一項為一個常數。

如何在C＃中實現維特比算法來拆分連詞？

問題描述

背景

我當前的遞歸解決方案

我更新的指數解決方案：

我想如何使用維特比算法

2 個解決方案

解決方案1
2 2016-12-07 22:24:04

解決方案2
1 2016-12-07 22:41:34

如何在C＃中實現維特比算法來拆分連詞？

問題描述

背景

我當前的遞歸解決方案

我更新的指數解決方案：

我想如何使用維特比算法

2 個解決方案

解決方案1 2 2016-12-07 22:24:04

解決方案2 1 2016-12-07 22:41:34

解決方案1
2 2016-12-07 22:24:04

解決方案2
1 2016-12-07 22:41:34