简体   繁体   中英

c# read file content code optimization

I have a large string which is converted from a text file (eg 1 MB text 0file) and I want to process the string. It takes near 10 minutes to process the string.

Basically string is read character by character and increment counter for each character by one, some characters such as space, comma, colon and semi-colon are counted as space and rest characters are just ignored and thus space's counter is incremented.

Code:

string fileContent = "....." // a large string 
int min = 0;
int max = fileContent.Length;
Dictionary<char, int> occurrence  // example c=>0, m=>4, r=>8 etc....

//  Note: occurrence has only a-z alphabets, and a space. comma, colon, semi-colon are coutned as space and rest characters ignored.

for (int i = min; i <= max; i++) // run loop to end 
{
    try // increment counter for alphabets and space
    {
        occurrence[fileContent[i]] += 1;
    }
    catch (Exception e) //  comma, colon and semi-colon are spaces
    {
        if (fileContent[i] == ' ' || fileContent[i] == ',' || fileContent[i] == ':' || fileContent[i] == ';')
        {
            occurrence[' '] += 1;
            //new_file_content += ' ';
        }
        else continue;
    }
    totalFrequency++; // increment total frequency
}

Try this:

        string input = "test string here";
        Dictionary<char, int> charDict = new Dictionary<char, int>();
        foreach(char c in input.ToLower()) {
            if(c < 97 || c > 122) {
                if(c == ' ' || c == ',' || c == ':' || c == ';') {
                    charDict[' '] = (charDict.ContainsKey(' ')) ? charDict[' ']++ : 0;
                }
            } else {
                charDict[c] = (charDict.ContainsKey(c)) ? charDict[c]++ : 0;
            }
        }

Given your loop is iterating through a large number you want to minimize the checks inside the loop and remove the catch which is pointed out in the comments. There should never be a reason to control flow logic with a try catch block. I would assume you initialize the dictionary first to set the occurrence cases to 0 otherwise you have to add to the dictionary if the character is not there. In the loop you can test the character with something like char.IsLetter() or other checks as D. Stewart is suggesting. I would not do a toLower on the large string if you are going to iterate every character anyway (this would do the iteration twice). You can do that conversion in the loop if needed.
Try something like the below code. You could also initialize all 256 possible characters in the dictionary and completely remove the if statement and then remove items you don't care about and add the 4 space items to the space character dictionary after the counting is complete.

foreach  (char c in fileContent) 
{
        if (char.IsLetter(c))
        {
            occurrence[c] += 1;
        }
        else
        {
            if (c == ' ' || c == ',' || c == ':' || c == ';')
            {
                occurrence[' '] += 1;
            }
        }
    }
}

You could initialize the entire dictionary in advance like this also:

for (int i = 0; i < 256; i++)
{
    occurrence.Add((char)i, 0);
}

There are several issues with that code snippet ( i <= max , accessing dictionary entry w/o being initialized etc.), but of course the performance bottleneck is relying on exceptions, since throwing / catching exceptions is extremely slow (especially when done in a inner loop).

I would start with putting the counts into a separate array.

Then I would either prepare a char to count index map and use it inside the loop w/o any if s:

var indexMap = new Dictionary<char, int>();
int charCount = 0;
// Map the valid characters to be counted
for (var ch = 'a'; ch <= 'z'; ch++)
    indexMap.Add(ch, charCount++);
// Map the "space" characters to be counted
foreach (var ch in new[] { ' ', ',', ':', ';' })
    indexMap.Add(ch, charCount);
charCount++;
// Allocate count array
var occurences = new int[charCount];
// Process the string
foreach (var ch in fileContent)
{
    int index;
    if (indexMap.TryGetValue(ch, out index))
        occurences[index]++;
}
// Not sure about this, but including it for consistency
totalFrequency = occurences.Sum();

or not use dictionary at all:

// Allocate array for char counts
var occurences = new int['z' - 'a' + 1];
// Separate count for "space" chars
int spaceOccurences = 0;
// Process the string
foreach (var ch in fileContent)
{
    if ('a' <= ch && ch <= 'z')
        occurences[ch - 'a']++;
    else if (ch == ' ' || ch == ',' || ch == ':' || ch == ';')
        spaceOccurences++;
}
// Not sure about this, but including it for consistency
totalFrequency = spaceOccurences + occurences.Sum();

The former is more flexible (you can add more mappings), the later - a bit faster. But both are fast enough (complete in milliseconds for 1M size string).

Ok, it´sa little late, but it should be the fastest solution:

using System.Collections.Generic;
using System.Linq;

namespace ConsoleApplication99
{
  class Program
  {
    static void Main(string[] args)
    {
      string fileContent = "....."; // a large string 

      // --- high perf section to count all chars ---

      var charCounter = new int[char.MaxValue + 1];

      for (int i = 0; i < fileContent.Length; i++)
      {
        charCounter[fileContent[i]]++;
      }


      // --- combine results with linq (all actions consume less than 1 ms) ---

      var allResults = charCounter.Select((count, index) => new { count, charValue = (char)index }).Where(c => c.count > 0).ToArray();

      var spaceChars = new HashSet<char>(" ,:;");
      int countSpaces = allResults.Where(c => spaceChars.Contains(c.charValue)).Sum(c => c.count);

      var usefulChars = new HashSet<char>("abcdefghijklmnopqrstuvwxyz");
      int countLetters = allResults.Where(c => usefulChars.Contains(c.charValue)).Sum(c => c.count);
    }
  }
}

for very large text-files, it´s better to use the StreamReader ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM