简体   繁体   English

C#读取文件内容代码优化

[英]c# read file content code optimization

I have a large string which is converted from a text file (eg 1 MB text 0file) and I want to process the string. 我有一个很大的字符串,它是从文本文件(例如1 MB文本0file)转换而来的,我想处理该字符串。 It takes near 10 minutes to process the string. 处理字符串大约需要10分钟。

Basically string is read character by character and increment counter for each character by one, some characters such as space, comma, colon and semi-colon are counted as space and rest characters are just ignored and thus space's counter is incremented. 基本上,字符串是一个字符一个字符地读取,并且每个字符的增量计数器都加一个,一些字符(例如空格,逗号,冒号和分号)被视为空格,而其余字符则被忽略 ,因此空格的计数器会增加。

Code: 码:

string fileContent = "....." // a large string 
int min = 0;
int max = fileContent.Length;
Dictionary<char, int> occurrence  // example c=>0, m=>4, r=>8 etc....

//  Note: occurrence has only a-z alphabets, and a space. comma, colon, semi-colon are coutned as space and rest characters ignored.

for (int i = min; i <= max; i++) // run loop to end 
{
    try // increment counter for alphabets and space
    {
        occurrence[fileContent[i]] += 1;
    }
    catch (Exception e) //  comma, colon and semi-colon are spaces
    {
        if (fileContent[i] == ' ' || fileContent[i] == ',' || fileContent[i] == ':' || fileContent[i] == ';')
        {
            occurrence[' '] += 1;
            //new_file_content += ' ';
        }
        else continue;
    }
    totalFrequency++; // increment total frequency
}

Try this: 尝试这个:

        string input = "test string here";
        Dictionary<char, int> charDict = new Dictionary<char, int>();
        foreach(char c in input.ToLower()) {
            if(c < 97 || c > 122) {
                if(c == ' ' || c == ',' || c == ':' || c == ';') {
                    charDict[' '] = (charDict.ContainsKey(' ')) ? charDict[' ']++ : 0;
                }
            } else {
                charDict[c] = (charDict.ContainsKey(c)) ? charDict[c]++ : 0;
            }
        }

Given your loop is iterating through a large number you want to minimize the checks inside the loop and remove the catch which is pointed out in the comments. 鉴于您的循环正在遍历大量对象,因此您希望最小化循环内部的检查并删除注释中指出的捕获。 There should never be a reason to control flow logic with a try catch block. 永远没有理由使用try catch块来控制流逻辑。 I would assume you initialize the dictionary first to set the occurrence cases to 0 otherwise you have to add to the dictionary if the character is not there. 我假设您首先初始化字典,以将出现情况设置为0,否则,如果字符不存在,则必须添加到字典中。 In the loop you can test the character with something like char.IsLetter() or other checks as D. Stewart is suggesting. 在循环中,您可以使用char.IsLetter()类的东西来测试字符,或者像D. Stewart建议的那样进行其他检查。 I would not do a toLower on the large string if you are going to iterate every character anyway (this would do the iteration twice). 如果您要迭代每个字符,我就不会在大字符串上执行toLower(这将使迭代两次)。 You can do that conversion in the loop if needed. 如果需要,您可以在循环中进行该转换。
Try something like the below code. 尝试类似下面的代码。 You could also initialize all 256 possible characters in the dictionary and completely remove the if statement and then remove items you don't care about and add the 4 space items to the space character dictionary after the counting is complete. 您还可以初始化字典中的所有256个可能的字符,并完全删除if语句,然后删除不需要的项,并在计数完成后将4个空格项添加到空格字符字典中。

foreach  (char c in fileContent) 
{
        if (char.IsLetter(c))
        {
            occurrence[c] += 1;
        }
        else
        {
            if (c == ' ' || c == ',' || c == ':' || c == ';')
            {
                occurrence[' '] += 1;
            }
        }
    }
}

You could initialize the entire dictionary in advance like this also: 您也可以像这样预先初始化整个字典:

for (int i = 0; i < 256; i++)
{
    occurrence.Add((char)i, 0);
}

There are several issues with that code snippet ( i <= max , accessing dictionary entry w/o being initialized etc.), but of course the performance bottleneck is relying on exceptions, since throwing / catching exceptions is extremely slow (especially when done in a inner loop). 该代码段存在多个问题( i <= max ,不带初始化就访问字典条目等),但是性能瓶颈当然依赖于异常,因为抛出/捕获异常非常慢(尤其是在一个内部循环)。

I would start with putting the counts into a separate array. 我将从将计数放入单独的数组开始。

Then I would either prepare a char to count index map and use it inside the loop w/o any if s: 然后,我要么准备一个char来计数索引映射,然后在循环中使用它( if s):

var indexMap = new Dictionary<char, int>();
int charCount = 0;
// Map the valid characters to be counted
for (var ch = 'a'; ch <= 'z'; ch++)
    indexMap.Add(ch, charCount++);
// Map the "space" characters to be counted
foreach (var ch in new[] { ' ', ',', ':', ';' })
    indexMap.Add(ch, charCount);
charCount++;
// Allocate count array
var occurences = new int[charCount];
// Process the string
foreach (var ch in fileContent)
{
    int index;
    if (indexMap.TryGetValue(ch, out index))
        occurences[index]++;
}
// Not sure about this, but including it for consistency
totalFrequency = occurences.Sum();

or not use dictionary at all: 还是根本不使用字典:

// Allocate array for char counts
var occurences = new int['z' - 'a' + 1];
// Separate count for "space" chars
int spaceOccurences = 0;
// Process the string
foreach (var ch in fileContent)
{
    if ('a' <= ch && ch <= 'z')
        occurences[ch - 'a']++;
    else if (ch == ' ' || ch == ',' || ch == ':' || ch == ';')
        spaceOccurences++;
}
// Not sure about this, but including it for consistency
totalFrequency = spaceOccurences + occurences.Sum();

The former is more flexible (you can add more mappings), the later - a bit faster. 前者更灵活(您可以添加更多映射),后者则更快。 But both are fast enough (complete in milliseconds for 1M size string). 但是两者都足够快(1M大小的字符串以毫秒为单位完成)。

Ok, it´sa little late, but it should be the fastest solution: 好的,这有点晚了,但这应该是最快的解决方案:

using System.Collections.Generic;
using System.Linq;

namespace ConsoleApplication99
{
  class Program
  {
    static void Main(string[] args)
    {
      string fileContent = "....."; // a large string 

      // --- high perf section to count all chars ---

      var charCounter = new int[char.MaxValue + 1];

      for (int i = 0; i < fileContent.Length; i++)
      {
        charCounter[fileContent[i]]++;
      }


      // --- combine results with linq (all actions consume less than 1 ms) ---

      var allResults = charCounter.Select((count, index) => new { count, charValue = (char)index }).Where(c => c.count > 0).ToArray();

      var spaceChars = new HashSet<char>(" ,:;");
      int countSpaces = allResults.Where(c => spaceChars.Contains(c.charValue)).Sum(c => c.count);

      var usefulChars = new HashSet<char>("abcdefghijklmnopqrstuvwxyz");
      int countLetters = allResults.Where(c => usefulChars.Contains(c.charValue)).Sum(c => c.count);
    }
  }
}

for very large text-files, it´s better to use the StreamReader ... 对于非常大的文本文件,最好使用StreamReader ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM