如何計算文本文檔中所有單詞的頻率？

Question

class CounterDict<TKey>
{
    public Dictionary<TKey, int> _dict = new Dictionary<TKey, int>();

    public void Add(TKey key)
    {
        if(_dict.ContainsKey(key))
            _dict[key]++;
        else
        {
            _dict.Add(key, 1);
        }
    }
}

class Program
{
    static void Main(string[] args)
    {
        string line =  "The woods decay the woods decay and fall.";

        CounterDict<string> freq = new CounterDict<string>();
        foreach (string item in line.Split())
        {
            freq.Add(item.Trim().ToLower());
        }

        foreach (string key in freq._dict.Keys)
        {
            Console.WriteLine("{0}:{1}",key,freq._dict[key]);
        }           
    }
}

我想計算字符串中所有單詞的出現次數。
我認為上面的代碼在此任務上會很慢，因為（查看Add函數）：

    if(_dict.ContainsKey(key))
    _dict[key]++;
    else
    {
        _dict.Add(key, 1);
    }

此外，保持_dict__ public良好做法嗎？ （我認為不是。）

我應該如何修改或完全更改它以完成工作？

Answer 1

這個怎么樣：

Dictionary<string, int> words = new Dictionary<string, int>();
string input = "The woods decay the woods decay and fall.";
foreach (Match word in Regex.Matches(input, @"\w+", RegexOptions.ECMAScript))
{
    if (!words.ContainsKey(word.Value))
    {
        words.Add(word.Value, 1);
    }
    else
    {
        words[word.Value]++;
    }
}

主要要點是用正則表達式替換.Split ，因此您不需要在內存中保留大字符串數組，並且可以一次處理一個項目。

Answer 2

從msdn文檔中：

    // When a program often has to try keys that turn out not to
    // be in the dictionary, TryGetValue can be a more efficient 
    // way to retrieve values.
    string value = "";
    if (openWith.TryGetValue("tif", out value))
    {
        Console.WriteLine("For key = \"tif\", value = {0}.", value);
    }
    else
    {
        Console.WriteLine("Key = \"tif\" is not found.");
    }

我自己還沒有進行測試，但這可能會提高您的效率。

Answer 3

這是一些計算字符串出現次數的方法。

如何計算文本文檔中所有單詞的頻率？

問題描述

3 個解決方案

解決方案1
4 已采納 2009-10-26 11:41:02

解決方案2
2 2009-10-26 11:44:07

解決方案3
1 2009-10-26 11:46:41

如何計算文本文檔中所有單詞的頻率？

問題描述

3 個解決方案

解決方案1 4 已采納 2009-10-26 11:41:02

解決方案2 2 2009-10-26 11:44:07

解決方案3 1 2009-10-26 11:46:41

解決方案1
4 已采納 2009-10-26 11:41:02

解決方案2
2 2009-10-26 11:44:07

解決方案3
1 2009-10-26 11:46:41