简体   繁体   English

C#中string.Split()与string.Substring()的效率?

[英]Efficiency of string.Split() vs. string.Substring() in C#?

I'm working on a project that involves taking large text files and parsing each line. 我正在从事一个涉及获取大型文本文件并解析每一行的项目。 The point is to parse the whole text file into cells, much like an Excel spreadsheet. 关键是将整个文本文件解析为单元格,非常类似于Excel电子表格。 Unfortunately, there are no delimiters for most of the files, so I need some sort of index-based method to manually create the cells, even if the column is blank. 不幸的是,大多数文件没有定界符,因此即使列为空,我也需要某种基于索引的方法来手动创建单元格。

Previously, lines were parsed by splitting on null, which worked well. 以前,通过拆分null来解析行,效果很好。 However, new data has made this method unreliable due to its not including blank cells, so I had to make a new method of parsing lines, which uses Substring. 但是,由于不包含空白单元格,因此新数据使该方法不可靠,因此我不得不制作一种使用Substring的行解析新方法。 The method takes in an array of integers indices and splits the strings on the given indices: 该方法接收一个整数索引数组,并在给定索引上拆分字符串:

private string[] SetCols3(int[] fixedWidthValues, string line)
{
    {
    string[] cols = new string[fixedWidthValues.Length];

    int columnLength;
    int FWV;
    int FWV2;

    bool lastOfFWV;
    bool outOfBounds;

    for (int x = 0; x < fixedWidthValues.Length; x++)
    {
        FWV = fixedWidthValues[x];
        lastOfFWV = x + 1 >= fixedWidthValues.Length;
        outOfBounds = lastOfFWV ? true : fixedWidthValues[x + 1] >= line.Length;
        FWV2 = lastOfFWV || outOfBounds ? line.Length : fixedWidthValues[x + 1];
        columnLength = FWV2 - FWV;
        columnLength *= columnLength < 0 ? -1 : 1;

        if (FWV < line.Length)
        {
            cols[x] = line.Substring(FWV, columnLength).Trim();
        }
    }

    return cols;
}

Quick breakdown of the code: the integers and booleans are just to handle blank columns, lines that are shorter than normal, etc., and to make the code cleaner for other people to understand a little better (as opposed to one long, convoluted if statement). 代码的快速分解:整数和布尔值仅用于处理空白列,比正常短的行等,并使代码更清晰,以使其他人更好地理解(而不是长而复杂的代码,如果声明)。

My question: is there a way to make this more efficient? 我的问题:是否有办法提高效率? For some reason, this method takes significantly longer than the previous method. 由于某种原因,此方法比以前的方法花费的时间长得多。 I understand it does more, so more time was expected. 我了解它的作用更多,因此预计会有更多时间。 However, the difference is surprisingly huge. 但是,差异惊人地巨大。 One iteration (with 15 indices) takes around 0.07 seconds (which is huge considering this method gets called several thousands time per file), compared to 0.00002 seconds on the high end for the method that splits on null. 一次迭代(包含15个索引)耗时约0.07秒(考虑到此方法每个文件被调用数千次,这是巨大的),相比之下,将空值拆分的方法在高端时为0.00002秒。 Is there something I can change in my code to noticeably increase its efficiency? 我可以在代码中进行一些更改以显着提高其效率吗? I haven't been able to find anything particularly useful after hours of searching online. 经过数小时的网上搜索,我一直找不到特别有用的东西。

Also, the number of indices/columns greatly affects the speed. 同样,索引/列的数量也会极大地影响速度。 For 15 columns, it takes around 0.07 seconds compared to 0.05 for 10 columns. 对于15根色谱柱,大约需要0.07秒,而对于10根色谱柱,则需要0.05秒。

First, 第一,

outOfBounds = lastOfFWV ? true : fixedWidthValues[x + 1] >= line.Length;

could be changed to 可以更改为

outOfBounds = lastOfFWV || fixedWidthValues[x + 1] >= line.Length;

Next, 下一个,

columnLength = FWV2 - FWV;
columnLength *= columnLength < 0 ? -1 : 1;

could be changed to 可以更改为

columnLength = Math.Abs(FWV2 - FWV);

And last, 最后,

if (FWV < line.Length)
{

could be moved to just after the FWV assignment at the top of the loop and changed to 可以移动到循环顶部FWV分配之后,并更改为

if (FWV < line.Length)
    continue;

But, I don't think any of these changes would make a significant impact on speed. 但是,我认为这些变化均不会对速度产生重大影响。 Possibly more impact would be gained by changing what's passed in. Instead of passing in the column starting positions and calculating the column widths for each line, which won't change, pass in the starting positions and column widths. 通过更改传入的内容,可能会获得更大的影响。不要传入列的起始位置并计算每行的列宽度(不会改变),而是传入起始位置和列宽。 This way there's no calculation involved. 这样就无需计算。

But rather than guessing, it'd be best to profile the method to find the hot spot(s). 但是,不要猜测,最好是通过概要分析该方法来找到热点。

The issue was two stray .ToInt32() calls I accidentally included (I don't know why they were there). 问题是我无意中包含了两个流浪.ToInt32()调用(我不知道它们为什么在那里)。 This particular method was a different method, one from my company, than the Convert.ToInt32(), and for some reason it was majorly inefficient when trying to convert numbers. 这种特殊方法与我的公司使用的方法不同于Convert.ToInt32(),并且由于某种原因,在尝试转换数字时效率很低。 For reference, the issues was on the following lines as follows: 供参考,问题在以下几行中:

FWV = fixedWidthValues[x].ToInt32();
...
FWV2 = lastOfFWV || outOfBounds ? line.Length : fixedWidthValues[x + 1].ToInt32();

Removing them increased the efficiency by 60 times... 去除它们使效率提高了60倍...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM