简体   繁体   English

如何通过忽略逗号C#拆分字符串?

[英]How to split a string by ignoring commas c#?

I have made a little project that takes .cs files, reads them and returns the most frequent word in the file. 我做了一个小项目,它使用.cs文件,读取它们并返回文件中最常用的单词。 However, right now it returns that the most common word is a comma. 但是,现在返回最常见的单词是逗号。 How can i make it so splitting the string ignores commas? 我怎样才能使拆分字符串忽略逗号?

For example: i have a string: 例如:我有一个字符串:

, . ,。 ? aa, b cdef cfed, abef abef abef, aa,b cdef cfed,abef abef abef,

right now it returns that the most common word is 'abef' and it occured 2 times (the program doesn't count the third abef, the one which is with a comma in the end.) 现在,它返回最常见的词是“ abef”,它出现了2次(程序不计算第三个abef,最后一个是逗号。)

Another example: 另一个例子:

, . ,。 ? aa, b cdef cfed, abef abef abef, , , aa,b cdef cfed,abef abef abef,,,

this right now returns that the most common word is a comma ',' and it occured 3 times, but the thing is - i want my program to ignore commas and focus purely on words only. 现在返回的是,最常见的单词是逗号“,”,它出现了3次,但问题是-我希望我的程序忽略逗号,而只关注单词。

namespace WindowsFormsApp8
{
  public partial class Form1 : Form
  {
    public Form1()
    {
        InitializeComponent();
    }


    private async void button1_Click(object sender, EventArgs e)
    {
        using (OpenFileDialog ofd = new OpenFileDialog() { Filter = "Text Documents |*.cs;*.txt", ValidateNames = true, Multiselect = false }) //openfiledialog (all .cs; all.txt)
        {
            if (ofd.ShowDialog() == DialogResult.OK) //if in file dialog a file gets selected
            {
                using (StreamReader sr = new StreamReader(ofd.FileName)) //text reader
                {
                    richTextBox1.Text = await sr.ReadToEndAsync(); //reads the file and returns it into textbox
                }
            }
        }
    }

    private void button2_Click(object sender, EventArgs e)
    {          
        string[] userText = richTextBox1.Text.ToLower().Split( ' ' );
        var frequencies = new Dictionary<string, int>(); // variable frequencies, dictionary with key string, value int.
        string highestWord = null;  //declare string highestword with starting value null.
        int highestFreq = 0; //declare integer highestfreq with starting value zero.

        foreach (string word in userText) //search words in our array userText that we declared at the beginning.
        {
            int freq; //declare integer freq.
            frequencies.TryGetValue(word, out freq); //trygetvalue from dictionary key, out value.
            freq += 1; //count it.

            if (freq > highestFreq) 
            {
                highestFreq = freq;
                highestWord = word;
            }
            frequencies[word] = freq; //assign most frequent word in frequencies dictionary to freq
        }
        MessageBox.Show("the most occuring word is: " + highestWord + ", it occured " + highestFreq + " times"); //display data to messagebox.
    }
  }
}

Split can take an array of chars to split on. 拆分可以使用一系列字符进行拆分。 So you can split on space and comma. 因此,您可以分割空格和逗号。 Then remove the empty entries with the appropriate StringSplitOption 然后使用适当的StringSplitOption删除空条目

 string[] userText = richTextBox1.Text.ToLower().Split(new char[] { ' ', ','}, StringSplitOptions.RemoveEmptyEntries );

Also you can use Linq to calculate the frequency of a word with code like this 您也可以使用Linq像这样的代码来计算单词的频率

var g = userText.GroupBy(x => x)
                .Select(z => new 
                { word = z.Key, count = z.Count()})
                .ToList();
string mostUsed = g.OrderByDescending(x => x.count)
                   .Select(x => x.word)
                   .FirstOrDefault();

You could replace the commas with an empty string, then run the output through your algorithm. 您可以将逗号替换为空字符串,然后通过算法运行输出。

string original = ", . ? a a, b cdef cfed, abef abef abef,";
string noCommas = original.Replace(",", string.Empty);

Reference: https://msdn.microsoft.com/en-us/library/fk49wtc1(v=vs.110).aspx 参考: https : //msdn.microsoft.com/zh-cn/library/fk49wtc1(v=vs.110).aspx

Another option is to make the splitting easier to extend by using regular expressions, Regex.Split more specifically: 另一个选择是通过使用正则表达式Regex.Split来更轻松地扩展拆分:

  string input = ", . ? a a, b cdef cfed, abef abef abef, , ,";
  string[] result = Regex.Split(input, @"\w+");

Check live testing here. 在此处检查实时测试

If ? 如果? is a valid word, than the regex could be @"\\w+|\\?" 是一个有效的单词,比正则表达式可能是@"\\w+|\\?" .

So, my recommendation is to use regex, even if the split method is enough for now, since it is more powerful and can easily accommodate for later changes. 因此,我建议使用正则表达式,即使split方法现在已经足够,因为它更强大并且可以轻松地适应以后的更改。

As a bonus, it is nice to learn about regular expressions. 另外,很高兴学习正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM