简体   繁体   English

C#中的字数统计算法

[英]Word Count Algorithm in C#

I am looking for a good word count class or function. 我正在寻找一个好的单词计数类或功能。 When I copy and paste something from the internet and compare it with my custom word count algorithm and MS Word it is always off by a little more then 10%. 当我从互联网上复制并粘贴一些内容并将其与我的自定义字数统计算法和MS Word进行比较时,它总是偏离10%多一点。 I think that is too much . 我觉得这太过分了。 So do you guys know of an accurate word count algorithm in c#. 那么你们在c#中知道一个准确的字数统计算法吗?

As @astander suggests, you can do a String.Split as follows: 正如@astander建议的那样,你可以按如下方式执行String.Split:

string[] a = s.Split(
    new char[] { ' ', ',', ';', '.', '!', '"', '(', ')', '?' },
    StringSplitOptions.RemoveEmptyEntries);

By passing in an array of chars, you can split on multiple word breaks. 通过传入一个字符数组,您可以分割多个单词分隔符。 Removing empty entries will keep you from counting non-word words. 删除空条目将使您无法计算非单词。

String.Split by predefined chars. String.Split由预定义的字符组成。 Use punctuations, spaces (remove multiple space), and any other chars that you determine to be "word splits" 使用标点符号,空格(删除多个空格)以及您确定为“单词拆分”的任何其他字符

What have you tried? 你有什么尝试?

I did see that the previous user got nailed for links, but here is some examples of using regex, or char matching. 我确实看到前一个用户被钉上了链接,但这里有一些使用正则表达式或字符匹配的例子。 Hope it helps, and nobody gets hurt X-) 希望它有所帮助,没有人受伤X-)

String.Split Method (Char[]) String.Split方法(Char [])

Word counter in C# C#中的字计数器

C# Word Count C#字数

Use a regular expression to find words (eg [\\w]+) and just count the matches 使用正则表达式查找单词(例如[\\ w] +)并计算匹配项

public static Regex regex = new Regex(
  "[\\w]+",
RegexOptions.Multiline
| RegexOptions.CultureInvariant
| RegexOptions.Compiled
);

regex.Match(_someString).Count regex.Match(_someString).Count之间

I've just had the same problem in ClipFlair, where I needed to calculate WPM (Words-per-minute) for Movie Captions, so I came up with the following one: 我在ClipFlair中遇到了同样的问题,我需要为电影字幕计算WPM(每分钟字数),所以我想出了以下一个:

You can define this static extension method in a static class and then add a using clause to the namespace of that static class at any class that needs to use this extension method. 您可以在静态类中定义此静态扩展方法,然后在需要使用此扩展方法的任何类中将using子句添加到该静态类的名称空间。 The extension method is invoked using s.WordCount(), where s is a string (an identifier [variable/constant] or literal) 使用s.WordCount()调用扩展方法,其中s是一个字符串(标识符[variable / constant]或literal)

public static int WordCount(this string s)
{
  int last = s.Length-1;

  int count = 0;
  for (int i = 0; i <= last; i++)
  {
    if ( char.IsLetterOrDigit(s[i]) &&
         ((i==last) || char.IsWhiteSpace(s[i+1]) || char.IsPunctuation(s[i+1])) )
      count++;
  }
  return count;
}

Here is the stripped down version of c# code class i made for counting words , asian words , charaters etc. This is almost same as Microsoft Word. 这是我用于计算单词,亚洲单词,字符等的c#代码类的精简版本。这与Microsoft Word几乎相同。 I developed the original code for counting words for Microsoft Word documents. 我开发了用于计算Microsoft Word文档单词的原始代码。

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    namespace BL {
    public class WordCount 
    {

    public int NonAsianWordCount { get; set; }
    public int AsianWordCount { get; set; }
    public int TextLineCount { get; set; }
    public int TotalWordCount { get; set; }
    public int CharacterCount { get; set; }
    public int CharacterCountWithSpaces { get; set; }


    //public string Text { get; set; }

    public WordCount(){}

    ~WordCount() {}


    public void GetCountWords(string s)
    {
        #region Regular Expression Collection
        string asianExpression = @"[\u3001-\uFFFF]";
        string englishExpression = @"[\S]+";
        string LineCountExpression = @"[\r]+";
        #endregion


        #region Asian Character
        MatchCollection asiancollection = Regex.Matches(s, asianExpression);

        AsianWordCount = asiancollection.Count; //Asian Character Count

        s = Regex.Replace(s, asianExpression, " ");

        #endregion 


        #region English Characters Count
        MatchCollection collection = Regex.Matches(s, englishExpression);
        NonAsianWordCount = collection.Count;
        #endregion

        #region Text Lines Count
        MatchCollection Lines = Regex.Matches(s, LineCountExpression);
        TextLineCount = Lines.Count;
        #endregion

        #region Total Character Count

        CharacterCount = AsianWordCount;
        CharacterCountWithSpaces = CharacterCount;

        foreach (Match word in collection)
        {
            CharacterCount += word.Value.Length ;
            CharacterCountWithSpaces += word.Value.Length + 1;
        }

        #endregion

        #region Total Character Count
        TotalWordCount = AsianWordCount + NonAsianWordCount;
        #endregion
    }
}
}

You also need to check for newlines , tabs , and non-breaking spaces . 您还需要检查newlinestabsnon-breaking spaces I find it best to copy the source text into a StringBuilder and replace all newlines, tabs, and sentence ending characters with spaces. 我发现最好将源文本复制到StringBuilder ,并用空格替换所有换行符,制表符和句子结​​束符。 Then split the string based on spaces. 然后根据空格拆分字符串。

public static class WordCount
{
    public static int Count(string text)
    {
        int wordCount = 0;
        text = text.Trim();// trim white spaces

        if (text == ""){return 0;} // end if empty text

        foreach (string word in text.Split(' ')) // or use any other char(instead of empty space ' ') that you consider a word splitter 
        wordCount++;
        return wordCount;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM