简体   繁体   中英

Word Count Algorithm in C#

I am looking for a good word count class or function. When I copy and paste something from the internet and compare it with my custom word count algorithm and MS Word it is always off by a little more then 10%. I think that is too much . So do you guys know of an accurate word count algorithm in c#.

As @astander suggests, you can do a String.Split as follows:

string[] a = s.Split(
    new char[] { ' ', ',', ';', '.', '!', '"', '(', ')', '?' },
    StringSplitOptions.RemoveEmptyEntries);

By passing in an array of chars, you can split on multiple word breaks. Removing empty entries will keep you from counting non-word words.

String.Split by predefined chars. Use punctuations, spaces (remove multiple space), and any other chars that you determine to be "word splits"

What have you tried?

I did see that the previous user got nailed for links, but here is some examples of using regex, or char matching. Hope it helps, and nobody gets hurt X-)

String.Split Method (Char[])

Word counter in C#

C# Word Count

Use a regular expression to find words (eg [\\w]+) and just count the matches

public static Regex regex = new Regex(
  "[\\w]+",
RegexOptions.Multiline
| RegexOptions.CultureInvariant
| RegexOptions.Compiled
);

regex.Match(_someString).Count

I've just had the same problem in ClipFlair, where I needed to calculate WPM (Words-per-minute) for Movie Captions, so I came up with the following one:

You can define this static extension method in a static class and then add a using clause to the namespace of that static class at any class that needs to use this extension method. The extension method is invoked using s.WordCount(), where s is a string (an identifier [variable/constant] or literal)

public static int WordCount(this string s)
{
  int last = s.Length-1;

  int count = 0;
  for (int i = 0; i <= last; i++)
  {
    if ( char.IsLetterOrDigit(s[i]) &&
         ((i==last) || char.IsWhiteSpace(s[i+1]) || char.IsPunctuation(s[i+1])) )
      count++;
  }
  return count;
}

Here is the stripped down version of c# code class i made for counting words , asian words , charaters etc. This is almost same as Microsoft Word. I developed the original code for counting words for Microsoft Word documents.

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    namespace BL {
    public class WordCount 
    {

    public int NonAsianWordCount { get; set; }
    public int AsianWordCount { get; set; }
    public int TextLineCount { get; set; }
    public int TotalWordCount { get; set; }
    public int CharacterCount { get; set; }
    public int CharacterCountWithSpaces { get; set; }


    //public string Text { get; set; }

    public WordCount(){}

    ~WordCount() {}


    public void GetCountWords(string s)
    {
        #region Regular Expression Collection
        string asianExpression = @"[\u3001-\uFFFF]";
        string englishExpression = @"[\S]+";
        string LineCountExpression = @"[\r]+";
        #endregion


        #region Asian Character
        MatchCollection asiancollection = Regex.Matches(s, asianExpression);

        AsianWordCount = asiancollection.Count; //Asian Character Count

        s = Regex.Replace(s, asianExpression, " ");

        #endregion 


        #region English Characters Count
        MatchCollection collection = Regex.Matches(s, englishExpression);
        NonAsianWordCount = collection.Count;
        #endregion

        #region Text Lines Count
        MatchCollection Lines = Regex.Matches(s, LineCountExpression);
        TextLineCount = Lines.Count;
        #endregion

        #region Total Character Count

        CharacterCount = AsianWordCount;
        CharacterCountWithSpaces = CharacterCount;

        foreach (Match word in collection)
        {
            CharacterCount += word.Value.Length ;
            CharacterCountWithSpaces += word.Value.Length + 1;
        }

        #endregion

        #region Total Character Count
        TotalWordCount = AsianWordCount + NonAsianWordCount;
        #endregion
    }
}
}

You also need to check for newlines , tabs , and non-breaking spaces . I find it best to copy the source text into a StringBuilder and replace all newlines, tabs, and sentence ending characters with spaces. Then split the string based on spaces.

public static class WordCount
{
    public static int Count(string text)
    {
        int wordCount = 0;
        text = text.Trim();// trim white spaces

        if (text == ""){return 0;} // end if empty text

        foreach (string word in text.Split(' ')) // or use any other char(instead of empty space ' ') that you consider a word splitter 
        wordCount++;
        return wordCount;
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM