简体   繁体   English

如何计算C#字符串中两个紧接着的单词的出现次数?

[英]How can I count occurences of two words following each other in a string in C#?

I did one word version using regex like this: 我使用正则表达式做了一个单词版本,如下所示:

public Dictionary<string, int> MakeOneWordDictionary(string content)
{
    Dictionary<string, int> words = new Dictionary<string, int>();
    // Regex checking word match
    var wordPattern = new Regex(@"\w+");
    // Refactor text and clear it from punctuation marks
    content = RemoveSigns(content);
    foreach (Match match in wordPattern.Matches(content))
    {
        int currentCount = 0;
        words.TryGetValue(match.Value, out currentCount);
        currentCount++;
        words[match.Value] = currentCount;
    }
    return words;
}

它给出了这样的输出

This piece of code returns words and their frequency in a dictionary. 这段代码在字典中返回单词及其频率。 I need two words version of this now. 我现在需要两个单词的版本。 Which will count occurences of two words following each other in a string. 这将计算两个单词在字符串中紧随其后出现的次数。

Should I modify the regex? 我应该修改正则表达式吗? If yes how should I modify it? 如果是,该如何修改?

I think this can be written in a more self-explaining way without RegExp. 我认为,无需RegExp,就可以用更不言自明的方式编写代码。

string input = "a a b test a a";
string[] words = input.Split(' ');

var combinations = from index in Enumerable.Range(0, words.Length-1)
                   select new Tuple<string,string>(words[index], words[index+1]);

var groupedTuples = combinations.GroupBy(t => t);
var countedCombinations = groupedTuples.Select(g => new { Value = g.First(), Count = g.Count()});

The first two lines define the input and split it by spaces, ie separate it into single words. 前两行定义输入并将其按空格分隔,即将其分隔为单个单词。 The third line goes through the array of words from the first to the (N-1)th element (where N is the number of words) and builds a tuple of the n-th and the (n+1)-th element. 第三行从第一个元素到第(N-1)th元素(其中N是单词数)遍历单词数组,并构建n-th (n+1)-th元素和第(n+1)-th元素的元组。 In the fourth line these tuples are grouped by themselves (two tuples with the same elements are considered equal). 在第四行中,这些元组由它们自己分组(两个具有相同元素的元组视为相等)。 In the last step/line, the the elements of each group are counted and the counts are stored in an anonymously typed variable, along with their respective values. 在最后一步/行中,对每个组的元素进行计数,并将计数及其各自的值存储在匿名键入的变量中。

This logic can also be applied to your RegExp version. 此逻辑也可以应用于您的RegExp版本。

Edit: To get a dictionary, like in your example, you can use the ToDictionary extension method 编辑:要获得字典,例如您的示例,可以使用ToDictionary扩展方法

var countedCombinations = groupedTuples.ToDictionary(g => g.First(), g => g.Count());

The first parameter is a selector method for the key, the second one for the value. 第一个参数是键的选择器方法,第二个参数是值的选择器方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM