简体   繁体   English

C#中的字符串操作优化

[英]String operation optimisation in C#

The following C# code takes 5 minutes to run: 以下C#代码需要5分钟才能运行:

int i = 1;
string fraction = "";
while (fraction.Length < 1000000)
{
    fraction += i.ToString();
    i++;
}

"Optimising it" like this causes it to run in 1.5 seconds: 这样的“优化”使其在1.5秒内运行:

int i = 1;
string fraction = "";
while (fraction.Length < 1000000)
{
    // concatenating strings is much faster for small strings
    string tmp = "";
    for (int j = 0; j < 1000; j++)
    {
        tmp += i.ToString();
        i++;
    }
    fraction += tmp;
}

EDIT: Some people suggested using StringBuilder , which is an excellent suggestion also, and this comes out at 0.06s: 编辑:有人建议使用StringBuilder ,这也是一个很好的建议,而结果为0.06s:

int i = 1;
StringBuilder fraction = new StringBuilder();
while (fraction.Length < 1000000)
{
    fraction.Append(i);
    i++;
}

Playing around to find the optimum value of j is a topic for another time, but why exactly does this non-obvious optimisation work so well? 努力寻找j的最佳值是另一个话题,但是为什么这种非显而易见的优化效果如此好呢? Also, on a related topic, I've heard it said that you should never use the + operator with strings, in favour of string.Format() , is this true? 另外,在一个相关主题上,我听说它说过永远不要对字符串使用+运算符,而应该使用string.Format() ,这是真的吗?

I don't get your results at all. 我完全没有得到您的结果。 On my box StringBuilder wins hands down. 在我的盒子上,StringBuilder胜出。 Could you post your full test program? 您可以发布完整的测试程序吗? Here's mine, with three variants - your string concatenation optimisation, the "simple" StringBuilder one, and StringBuilder with an initial capacity. 这是我的,具有三个变体-您的字符串连接优化,一个“简单的” StringBuilder和一个具有初始容量的StringBuilder。 I've increased the limit as it was going too fast on my box to be usefully measurable. 我提高了极限,因为极限太快了,无法测量。

using System;
using System.Diagnostics;
using System.Text;

public class Test
{
    const int Limit = 4000000;

    static void Main()
    {
        Time(Concatenation, "Concat");
        Time(SimpleStringBuilder, "StringBuilder as in post");
        Time(SimpleStringBuilderNoToString, "StringBuilder calling Append(i)");
        Time(CapacityStringBuilder, "StringBuilder with appropriate capacity");
    }

    static void Time(Action action, string name)
    {
        Stopwatch sw = Stopwatch.StartNew();
        action();
        sw.Stop();
        Console.WriteLine("{0}: {1}ms", name, sw.ElapsedMilliseconds);
        GC.Collect();
        GC.WaitForPendingFinalizers();
    }

    static void Concatenation()
    {
        int i = 1;
        string fraction = "";
        while (fraction.Length < Limit)
        {
            // concatenating strings is much faster for small strings
            string tmp = "";
            for (int j = 0; j < 1000; j++)
            {
                tmp += i.ToString();
                i++;
            }
            fraction += tmp;            
        }
    }

    static void SimpleStringBuilder()
    {
        int i = 1;
        StringBuilder fraction = new StringBuilder();
        while (fraction.Length < Limit)
        {
            fraction.Append(i.ToString());
            i++;
        }
    }

    static void SimpleStringBuilderNoToString()
    {
        int i = 1;
        StringBuilder fraction = new StringBuilder();
        while (fraction.Length < Limit)
        {
            fraction.Append(i);
            i++;
        }
    }

    static void CapacityStringBuilder()
    {
        int i = 1;
        StringBuilder fraction = new StringBuilder(Limit + 10);
        while (fraction.Length < Limit)
        {
            fraction.Append(i);
            i++;
        }
    }
}

And the results: 结果:

Concat: 5879ms
StringBuilder as in post: 206ms
StringBuilder calling Append(i): 196ms
StringBuilder with appropriate capacity: 184ms

The reason your concatenation is faster than the very first solution is simple though - you're doing several "cheap" concatenations (where relatively little data is being copied each time) and relatively few "large" concatenations (of the whole string so far). 串联比第一个解决方案快的原因很简单-您正在执行多个“便宜”的串联(每次都复制相对较少的数据)和相对较少的“大”串联(到目前为止,整个字符串) 。 In the original, every step would copy all of the data obtained so far, which is obviously more expensive. 在原始版本中, 每个步骤都将复制到目前为止获得的所有数据,这显然更昂贵。

Use StringBuilder for concatenating more than (approximately) 5 strings (results may vary slightly). 使用StringBuilder可以连接(大约)5个以上的字符串(结果可能会略有不同)。 Also, give the StringBuilder's constructor a hint on the expected maximum size. 另外,为StringBuilder的构造函数提供有关预期最大大小的提示。

[Update]: just commenting on your edit to the question. [更新]:仅评论您对问题的编辑。 You can also increase StringBuilder 's performance if you have an approximate (or exact) idea of the final size of the concatenated strings, because this will reduce the number of memory allocations it has to perform: 如果对级联字符串的最终大小有一个大概(或精确)的想法,那么您还可以提高StringBuilder的性能,因为这将减少它必须执行的内存分配数量:

// e.g. Initialise to 10MB
StringBuilder fraction = new StringBuilder(10000000);

You will probably see that the first 1000 chars will take almost no time opposed to the last 1000 chars. 您可能会发现,与后1000个字符相反,前1000个字符几乎不需要时间。

I would assume that the time-consuming part is the actual copying of the large string into a new memory-area every time you add a char that is the tough work for your computer. 我认为耗时的部分是每次添加char时将大字符串实际复制到新的内存区域中,这对计算机来说是艰巨的工作。

Your optimization can easily be compared to what you usually do with streams, you use a buffer. 您可以使用缓冲区轻松地将优化与通常使用流进行比较。 Larger chunks will usually result in better performance until you hit the critical size where it no longer makes any difference, and starts to be a downside when your handling small amounts of data. 较大的块通常会带来更好的性能,直到您达到不再有任何区别的临界大小为止,并且在处理少量数据时开始成为不利方面。

If you however would have defined a char-array with the appropriate size from the beginning, it would probably be blazing fast, because then it won't have to copy it over and over again. 但是,如果您从一开始就定义了一个具有适当大小的char数组,那么它可能会很快发展起来,因为那样就不必一遍又一遍地复制它了。

Also, on a related topic, I've heard it said that you should never use the + operator with strings, in favour of string.Format(), is this true? 另外,在一个相关主题上,我听说它说过永远不要对字符串使用+运算符,而应该使用string.Format(),这是真的吗?

No, like all absolute statements it's nonsense. 不,像所有绝对声明一样,这是胡说八道。 However, it is true that using Format usually makes formatting code more readable and it's often slightly faster than concatenation – but speed isn't the deciding factor here. 但是,这事实,使用Format通常会使格式化代码的可读性和它往往略快于拼接-但这里的速度是不是决定因素。

As for your code … it results in smaller strings being copied (namely, tmp ) in the concatenation. 至于您的代码…会导致在串联中复制较小的字符串(即tmp )。 Of course, in fraction += tmp you copy a larger string but this happens less often. 当然,在fraction += tmp您复制了一个较大的字符串,但是这种情况很少发生。

Therefore, you've reduced many large copies to a few large and many small copies. 因此,您已将许多大型副本缩减为几个大型副本。

Hmm, I've just noticed that your outer loop has the same size in both cases. 嗯,我刚刚注意到,在两种情况下,您的外循环都具有相同的大小。 This shouldn't be faster, then. 那么,这应该不会更快。

I can't do tests now, but try to use StringBuilder. 我现在无法进行测试,但是请尝试使用StringBuilder。

int i = 1;
    StringBuilder fraction = new StringBuilder();
    while (fraction.Length < 1000000)
    {
        fraction.Append(i);
        i++;
    }
return sb.ToString();

Answer to the modified queston ("why does this non-obvious optimization work so well" and "is it true you shouldn't use + operator on strings"): 对修改后的问题的回答(“为什么这种非显而易见的优化效果如此好”和“是真的,您不应该对字符串使用+运算符”):

I'm not sure which non-obvious optimization you are talking about. 我不确定您在谈论哪种非显而易见的优化。 But the answer to the second question, I think, covers all of the bases. 但是,我认为第二个问题的答案涵盖了所有基础。

The way strings work in C# is that they are allocated as fixed-length, and cannot be changed. 字符串在C#中的工作方式是将它们分配为固定长度,并且不能更改。 This means that any time you try to change the length of the string, an entire new string is created and the old string is copied in up to the proper length. 这意味着,每当您尝试更改字符串的长度时,都会创建一个完整的新字符串,并以适当的长度复制旧字符串。 This is obviously a slow process. 这显然是一个缓慢的过程。 When you use String.Format it internally uses a StringBuilder to create the string. 当您使用String.Format时,它在内部使用StringBuilder来创建字符串。

StringBuilders work by using a memory buffer which is more intelligently allocated than fixed-length strings, and thus performs significantly better in most situations. StringBuilders通过使用比固定长度字符串更智能地分配的内存缓冲区来工作,因此在大多数情况下,其性能要好得多。 I'm not sure on the details of StringBuilder internally, so you'll have to ask a new question for that. 我不确定内部内部的StringBuilder的详细信息,因此您必须为此提出一个新问题。 I can speculate it either doesn't reallocate the old portions of the string (instead creating a linked list internally and only actually allocating the final output when needed by ToString) or it reallocates with exponential growth (when it runs out of memory, it allocates twice as much the next time, thus for a 2GB string it would only need to reallocate about 30 times). 我可以推测它要么不重新分配字符串的旧部分(而是在内部创建一个链表,并且仅在ToString需要时才实际分配最终输出),要么它以指数增长重新分配(当它用尽内存时,它会分配下次将存储空间增加一倍,因此,对于2GB的字符串,只需重新分配大约30次即可。

Your example with the nested loops grows linearly. 您的带有嵌套循环的示例呈线性增长。 it takes a small string and grows that up to 1000, and then tacks that 1000 on to the larger string in one large operation. 它需要一个小的字符串,并将其增长到1000,然后在一次大型操作中将1000附加到较大的字符串上。 As the large string gets really large, the copy that results from creating a new string gets to take a long time. 随着大字符串的变大,创建新字符串产生的副本将花费很长时间。 When you reduce the amount of times this is done (by instead resizing a smaller string more often instead) you increase the speed. 当减少此操作的次数时(通过改为更频繁地调整较小的字符串的大小),可以提高速度。 Of course, StringBuilder is even smarter about allocating memory, and thus is much faster. 当然,StringBuilder在分配内存方面更加智能,因此速度更快。

Adding a character to a string can have two consequences: 将字符添加到字符串可能有两个结果:

  • if there is still space for the character it is just added at the end; 如果字符仍有空间,则将其添加到最后; (as a commenter noticed, this can not happen with c# strings, as thy are immutable). (正如评论者所注意到的,这对于c#字符串是不可能发生的,因为您是不可变的)。
  • if there is no space at the end a new block of memory is allocated for the new string, the contents of the old string is copied there and the character is added. 如果末尾没有空间,则会为新字符串分配一个新的内存块,然后将旧字符串的内容复制到此处,并添加字符。

To analyse your code, it is simpler to add 1000000 times a single character. 要分析您的代码,将单个字符添加1000000次比较简单。 Your exact example is a bit more complex to explain because for higher i's you add more characters at a time. 您的确切示例要解释的有点复杂,因为对于更高的i,您一次添加了更多字符。

Then in the situation where no extra space is reserved, the first example has to do 1000000 allocations and copies, of an average of 0.5 * 1000000 characters. 然后,在没有保留额外空间的情况下,第一个示例必须执行1000000个分配和复制,平均分配0.5 * 1000000个字符。 The second one has to do 1000 allocations and copies of an average 0.5 * 1000000 characters, and 1000000 allocations and copies of 0.5 * 1000 characters. 第二个必须进行1000个平均0.5 * 1000000个字符的分配和副本,以及1000000个0.5 * 1000个字符的分配和副本。 If copying is lineair with the size of the copy and allocation free, the first situation takes 500 000 000 000 units of time and the second one 500 000 000 + 500 000 000 units of time. 如果复制是自由复制的复制品,则第一种情况需要5000亿单位时间,第二种情况需要500000000 + 500000000单位时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM