简体   繁体   English

字符串操作的更好选择-.NET

[英]Better option for String Manipulation - .NET

I'm working with huge string data for a project in C#. 我正在为C#中的项目处理巨大的string数据。 I'm confused about which approach should I use to manipulate my string data. 我对应该使用哪种方法来处理string数据感到困惑。

First Approach: 第一种方法:

StringBuilder myString = new StringBuilder().Append(' ', 1024);

while(someString[++counter] != someChar)
    myString[i++] += someString[counter];


Second Approach: 第二种方法:

String myString = new String();

int i = counter;
while(soumeString[++counter] != someChar);
myString = someString.SubString(i, counter - i);

Which one of the two would be more fast(and efficient)? 两者中哪一个会更快(和更有效)? Considering the strings I'm working with are huge. 考虑到我正在使用的字符串很大。

The strings are already in the RAM . 字符串已经在RAM The size of the string can vary from 32MB-1GB. 字符串的大小可以从32MB-1GB不等。

You should use IndexOf rather than doing individual character manipulations in a loop, and add whole chunks of string to the result: 您应该使用IndexOf而不是在循环中进行单独的字符操作,然后将整个字符串块添加到结果中:

StringBuilder myString = new StringBuilder();
int pos = someString.IndexOf(someChar, counter);
myString.Append(someString.SubString(counter, pos));

For "huge" strings, it may make sense to take a streamed approach and not load the whole thing into memory. 对于“巨大”的字符串,采取流式处理方法而不将整个内容加载到内存中可能是有意义的。 For the best raw performance, you can sometimes squeeze a little more speed out by using pointer math to search and capture pieces of strings. 为了获得最佳原始性能,有时可以使用指针数学来搜索和捕获字符串片段,从而加快速度。

To be clear, I'm stating two completely different approaches. 明确地说,我要说明两种完全不同的方法。

1 - Stream 1-流
The OP doesn't say how big these strings are, but it may be impractical to load them into memory. OP并未说明这些字符串的大小,但是将它们加载到内存中可能不切实际。 Perhaps they are being read from a file, from a data reader connected to a DB, from an active network connection, etc. 也许正在从文件,从连接到DB的数据读取器,从活动的网络连接等读取它们。

In this scenario, I would open a stream, read forward, buffering my input in a StringBuilder until the criteria was met. 在这种情况下,我将打开一个流,向前读取,将我的输入缓冲在StringBuilder直到满足条件为止。

2 - Unsafe Char Manipulation 2-不安全的字符处理
This requires that you do have the complete string. 这就需要有完整的字符串。 You can obtain a char* to the start of a string quite simply: 您可以很简单地在字符串开头获取char *:

// fix entire string in memory so that we can work w/ memory range safely
fixed( char* pStart = bigString ) 
{
    char* pChar = pStart; // unfixed pointer to start of string
    char* pEnd = pStart + bigString.Length;
}

You can now increment pChar and examine each character. 现在,您可以递增pChar并检查每个字符。 You can buffer it (eg if you want to examine multiple adjacent characters) or not as you choose. 您可以选择是否缓冲它(例如,如果要检查多个相邻字符)。 Once you determine the ending memory location, you now have a range of data that you can work with. 一旦确定了结束存储位置,便可以使用一系列数据。

Unsafe Code and Pointers in c# C#中的不安全代码和指针

2.1 - A Safer Approach 2.1-更安全的方法

If you are familiar with unsafe code, it is very fast, expressive, and flexible. 如果您熟悉不安全的代码,它会非常快速,富有表现力并且非常灵活。 If not, I would still use a similar approach, but without the pointer math. 如果没有,我仍然会使用类似的方法,但是没有指针数学。 This is similar to the approach which @supercat suggested, namely: 这类似于@supercat建议的方法,即:

  • Get a char[]. 获取一个字符[]。
  • Read through it character by character. 逐个字符地通读它。
  • Buffer where needed. 在需要的地方缓冲。 StringBuilder is good for this; StringBuilder对此StringBuilder set an initial size and reuse the instance. 设置初始大小并重用实例。
  • Analyze buffer where needed. 在需要的地方分析缓冲区。
  • Dump buffer often. 经常转储缓冲区。
  • Do something with the buffer when it contains the desired match. 当缓冲区包含所需的匹配项时,对其进行处理。

And an obligatory disclaimer for unsafe code: The vast majority of the time the framework methods are a better solution. 对于不安全的代码,有强制性的免责声明:在大多数情况下,框架方法是更好的解决方案。 They are safe, tested, and invoked millions of times per second. 它们是安全的,经过测试的,每秒可调用数百万次。 Unsafe code puts all of the responsibility on the developer. 不安全的代码使开发人员承担全部责任。 It does not make any assumptions; 它没有做任何假设; it's up to you to be a good framework/OS citizen (eg not overwriting immutable strings, allowing buffer overruns, etc.). 您要成为一个好的框架/ OS公民(例如,不覆盖不可变的字符串,允许缓冲区溢出等)。 Because it does not make any assumptions and removes the safeguards, it will often yield a performance increase. 因为它没有做任何假设并取消了保护措施,所以通常可以提高性能。 It's up to the developer to determine if there is indeed a benefit, and to decide if the advantages are significant enough. 由开发人员确定是否确实有好处,并确定好处是否足够显着。

Per request from OP, here are my test results. 根据OP的要求,这是我的测试结果。

Assumptions: 假设:

  • Big string is already in memory, no requirement for reading from disk 大字符串已经在内存中,不需要从磁盘读取
  • Goal is to not use any native pointers/unsafe blocks 目标是不使用任何本机指针/不安全块
  • The "checking" process is simple enough that something like Regex is not needed. “检查”过程非常简单,因此不需要正则表达式之类的东西。 For now simplifying to a single char comparison. 现在简化为单个字符比较。 The below code can easily be modified to consider multiple chars at once, this should have no effect on the relative performance of the two approaches. 下面的代码可以很容易地修改为一次考虑多个字符,这对两种方法的相对性能没有影响。

     public static void Main() { string bigStr = GenString(100 * 1024 * 1024); Stopwatch sw = Stopwatch.StartNew(); for (int i = 0; i < 10; i++) { int counter = -1; StringBuilder sb = new StringBuilder(); while (bigStr[++counter] != 'x') sb.Append(bigStr[counter]); Console.WriteLine(sb.ToString().Length); } sw.Stop(); Console.WriteLine("StringBuilder: {0}", sw.Elapsed.TotalSeconds); sw = Stopwatch.StartNew(); for (int i = 0; i < 10; i++) { int counter = -1; while (bigStr[++counter] != 'x') ; Console.WriteLine(bigStr.Substring(0, counter).Length); } sw.Stop(); Console.WriteLine("Substring: {0}", sw.Elapsed.TotalSeconds); } public static string GenString(int size) { StringBuilder sb = new StringBuilder(size); for (int i = 0; i < size - 1; i++) { sb.Append('a'); } sb.Append('x'); return sb.ToString(); } 

Results (release build, .NET 4): 结果(发布版本,.NET 4):

StringBuilder ~7.9 sec StringBuilder〜7.9

Substring ~1.9 sec 子串〜1.9

StringBuilder was consistently > 3x slower, with a variety of different sized strings. StringBuilder始终慢> 3倍,并且使用各种大小不同的字符串。

There's an IndexOf operation which would search more quickly for someChar , but I'll assume your real function to find the desired length is more complicated than that. 有一个IndexOf操作可以更快地搜索someChar ,但是我假设您要查找所需长度的实际函数要比这复杂得多。 In that scenario, I would recommend copying someString to a Char[] , doing the search, and then using the new String(Char[], Int32, Int32) constructor to produce the final string. 在这种情况下,我建议将someString复制到Char[]进行搜索,然后使用new String(Char[], Int32, Int32)构造函数生成最终的字符串。 Indexing a Char[] is going to be so much more efficient than indexing an String or StringBuilder that unless you expect that you'll typically be needing only a small fraction of the string, copying everything to the Char[] will be a 'win' (unless, of course, you could simply use something like IndexOf ). Char[]进行索引比对StringStringBuilder索引要高效得多,除非您通常只需要字符串的一小部分,否则将所有内容复制到Char[]将是一个“胜利”。 '(当然,除非您可以简单地使用IndexOf东西)。

Even if the length of the string will often be much larger than the length of interest, you may still be best off using a Char[] . 即使字符串的长度通常比感兴趣的长度大得多,使用Char[]还是最好的选择。 Pre-initialize the Char[] to some size, and then do something like: Char[]预初始化为某个大小,然后执行以下操作:

Char[] temp = new Char[1024];
int i=0;
while (i < theString.Length)
{
  int subLength = theString.Length - i;
  if (subLength > temp.Length)  // May impose other constraints on subLength, provided
    subLength = temp.Length;    // it's greater than zero.
  theString.CopyTo(i, temp, 0, subLength);
  ... do stuff with the array
  i+=subLength;
}

Once you're all done, you may then use a single SubString call to construct a string with the necessary characters from the original. 完成所有操作后,您可以使用一个SubString调用来构造带有原始字符中必要字符的字符串。 If your application requires buinding a string whose characters differ from the original, you could use a StringBuilder and, within the above loop, use the Append(Char[], Int32, Int32) method to add processed characters to it. 如果您的应用程序需要绑定字符与原始字符不同的字符串,则可以使用StringBuilder并在上述循环中使用Append(Char[], Int32, Int32)方法向其中添加已处理的字符。

Note also that when the above loop construct, one may decide to reduce subLength at any point in the loop provided it is not reduced to zero. 还要注意的是,当使用上述循环构造时,只要不将subLength减小subLength ,就可以决定在循环中的任何点减小subLength For example, if one is trying to find whether the string contains a prime number of sixteen or fewer digits enclosed by parentheses, one could start by scanning for an open-paren; 例如,如果试图查找字符串是否包含用括号括起来的十六进制或更少的质数,则可以从扫描开放式括号开始; if one finds it and it's possible that the data one is looking for might extend beyond the array, set subLength to the position of the open-paren, and reloop. 如果找到它,并且正在寻找的数据可能会超出数组,请将subLength设置为开放父级的位置,然后重新循环。 Such an approach will result in a small amount of redundant copying, but not much (often none), and will eliminate the need to keep track of parsing state between loops. 这种方法将导致少量的冗余复制,但不会很多(通常没有),并且消除了跟踪循环之间的解析状态的需要。 A very convenient pattern. 一个非常方便的模式。

You always want to use StringBuilder when manipulating strings. 在处理字符串时,您总是想使用StringBuilder。 This is becwuse strings are immutable, so every time a new object needs to be created. 因为字符串是不可变的,所以每次需要创建一个新对象时。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM