简体   繁体   English

如何胜过此正则表达式替换?

[英]How to outperform this regex replacement?

After considerable measurement, I have identified a hotspot in one of our windows services that I'd like to optimize. 经过大量测量后,我在我要优化的Windows服务之一中发现了一个热点。 We are processing strings that may have multiple consecutive spaces in it, and we'd like to reduce to only single spaces. 我们正在处理可能在其中包含多个连续空格的字符串,并且我们希望减少为仅单个空格。 We use a static compiled regex for this task: 我们为此任务使用静态的已编译正则表达式:

private static readonly Regex 
    regex_select_all_multiple_whitespace_chars = 
        new Regex(@"\s+",RegexOptions.Compiled);

and then use it as follows: 然后按如下方式使用它:

var cleanString=
    regex_select_all_multiple_whitespace_chars.Replace(dirtyString.Trim(), " ");

This line is being invoked several million times, and is proving to be fairly intensive. 这条线被调用了数百万次,并且被证明是相当密集的。 I've tried to write something better, but I'm stumped. 我试图写出更好的东西,但是我很沮丧。 Given the fairly modest processing requirements of the regex, surely there's something faster. 鉴于正则表达式的处理要求非常适中,因此肯定会有更快的处理速度。 Could unsafe processing with pointers speed things further? 使用指针进行unsafe处理可以进一步加快处理速度吗?

Edit: 编辑:

Thanks for the amazing set of responses to this question... most unexpected! 感谢您对这个问题的惊人回答……最出乎意料的!

This is about three times faster: 这快了大约三倍:

private static string RemoveDuplicateSpaces(string text) {
  StringBuilder b = new StringBuilder(text.Length);
  bool space = false;
  foreach (char c in text) {
    if (c == ' ') {
      if (!space) b.Append(c);
      space = true;
    } else {
      b.Append(c);
      space = false;
    }
  }
  return b.ToString();
}

How about this... 这个怎么样...

public string RemoveMultiSpace(string test)
{
var words = test.Split(new char[] { ' ' }, 
    StringSplitOptions.RemoveEmptyEntries);
return string.Join(" ", words);
}

Test case run with NUnit: 使用NUnit运行的测试用例:
Test time is in milliseconds. 测试时间以毫秒为单位。

Regex Test time: 338,8885
RemoveMultiSpace Test time: 78,9335
private static readonly Regex regex_select_all_multiple_whitespace_chars =
   new Regex(@"\s+", RegexOptions.Compiled);

[Test]
public void Test()
{
    string startString = "A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      ";
    string cleanString;
    Trace.WriteLine("Regex Test start");
    int count = 10000;
    Stopwatch timer = new Stopwatch();
    timer.Start();
    for (int i = 0; i < count; i++)
    {
        cleanString = regex_select_all_multiple_whitespace_chars.Replace(startString, " ");
    }
    var elapsed = timer.Elapsed;
    Trace.WriteLine("Regex Test end");
    Trace.WriteLine("Regex Test time: " + elapsed.TotalMilliseconds);

    Trace.WriteLine("RemoveMultiSpace Test start");
    timer = new Stopwatch();
    timer.Start();
    for (int i = 0; i < count; i++)
    {
        cleanString = RemoveMultiSpace(startString);
    }
    elapsed = timer.Elapsed;
    Trace.WriteLine("RemoveMultiSpace Test end");
    Trace.WriteLine("RemoveMultiSpace Test time: " + elapsed.TotalMilliseconds);
}

public string RemoveMultiSpace(string test)
{
    var words = test.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
    return string.Join(" ", words);
}

Edit: 编辑:
Made some more tests and added Guffa´s method "RemoveDuplicateSpaces" based on StringBuilder. 进行了更多测试,并添加了基于StringBuilder的Guffa方法“ RemoveDuplicateSpaces”。
So my conclusion is that the StringBuilder method is faster when there is a lot of spaces, but with less spaces the string split method is slightly faster. 因此,我的结论是,当存在很多空格时,StringBuilder方法会更快,但如果空格较少,则字符串拆分方法会稍快一些。

Cleaning file with about 30000 lines, 10 iterations
RegEx time elapsed: 608,0623
RemoveMultiSpace time elapsed: 239,2049
RemoveDuplicateSpaces time elapsed: 307,2044

Cleaning string, 10000 iterations:
A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      A B  C   D    E     F      
RegEx time elapsed: 590,3626
RemoveMultiSpace time elapsed: 159,4547
RemoveDuplicateSpaces time elapsed: 137,6816

Cleaning string, 10000 iterations:
A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      A      B      C      D      E      F      
RegEx time elapsed: 290,5666
RemoveMultiSpace time elapsed: 64,6776
RemoveDuplicateSpaces time elapsed: 52,4732

Currently, you are replacing a single space with another single space. 当前,您正在用另一个空格替换一个空格。 Try to match \\s{2,} (or something similar, if you want to replace single newlines and other characters). 尝试匹配\\s{2,} (或类似的名称,如果要替换单个换行符和其他字符)。

只是一个建议,如果您的数据没有Unicode空格,请使用[ \\r\\n]+[ \\n]+或只是+ (如果只有空格),而不是\\s+ ,基本上将其限制为最小字符集。

You could not use regular expressions. 您不能使用正则表达式。 For example: 例如:

private static string NormalizeWhitespace(string test)
{
    string trimmed = test.Trim();

    var sb = new StringBuilder(trimmed.Length);

    int i = 0;
    while (i < trimmed.Length)
    {
        if (trimmed[i] == ' ')
        {
            sb.Append(trimmed[i]);

            do { i++; } while (i < trimmed.Length && trimmed[i] == ' ');
        }

        sb.Append(trimmed[i]);

        i++;
    }

    return sb.ToString();
}

With this method and the following test bed: 使用此方法和以下测试台:

private static readonly Regex MultipleWhitespaceRegex = new Regex(
    @"\s+", 
    RegexOptions.Compiled);

static void Main(string[] args)
{
    string test = "regex  select    all multiple     whitespace   chars";

    const int Iterations = 15000;

    var sw = new Stopwatch();

    sw.Start();
    for (int i = 0; i < Iterations; i++)
    {
        NormalizeWhitespace(test);
    }
    sw.Stop();
    Console.WriteLine("{0}ms", sw.ElapsedMilliseconds);

    sw.Reset();

    sw.Start();
    for (int i = 0; i < Iterations; i++)
    {
        MultipleWhitespaceRegex.Replace(test, " ");
    }
    sw.Stop();
    Console.WriteLine("{0}ms", sw.ElapsedMilliseconds);
}

I got the following results: 我得到以下结果:

// NormalizeWhitespace - 27ms
// Regex - 132ms

Note that this was only tested with a very simple example, could be further optimized by removing the call to String.Trim and is only provided to make a point of regular expressions sometimes not being the best answer. 请注意,这仅通过一个非常简单的示例进行了测试,可以通过删除对String.Trim的调用来进一步优化,并且仅用于提出正则表达式,有时不是最佳答案。

I'm curious how a straight forward implementation might perform: 我很好奇直接实现会如何执行:

    static string RemoveConsecutiveSpaces(string input)
    {
        bool whiteSpaceWritten = false;
        StringBuilder sbOutput = new StringBuilder(input.Length);

        foreach (Char c in input)
        {
            if (c == ' ')
            {
                if (!whiteSpaceWritten)
                {
                    whiteSpaceWritten = true;
                    sbOutput.Append(c);
                }
            }
            else
            {
                whiteSpaceWritten = false;
                sbOutput.Append(c);
            }
        }

        return sbOutput.ToString();
    }

As it is such a simple expression, replacing two or more spaces with a single space, get rid of the Regex object and hard code the replacement yourself (in C++/CLI): 因为它是一个简单的表达式,所以用一个空格替换两个或多个空格,就可以摆脱Regex对象,并亲自对替换进行硬编码(在C ++ / CLI中):

String ^text = "Some   text  to process";
bool spaces = false;
// make the following static and just clear it rather than reallocating it every time
System::Text::StringBuilder ^output = gcnew System::Text::StringBuilder;
for (int i = 0, l = text->Length ; i < l ; ++i)
{
  if (spaces)
  {
    if (text [i] != ' ')
    {
      output->Append (text [i]);
      spaces = false;
    }
  }
  else
  {
    output->Append (text [i]);
    if (text [i] == ' ')
    {
      spaces = true;
    }
  }
}
text = output->ToString ();

Arrays always will be faster 数组总是会更快

        public static string RemoveMultiSpace(string input)
    {
        var value = input;

        if (!string.IsNullOrEmpty(input))
        {
            var isSpace = false;
            var index = 0;
            var length = input.Length;
            var tempArray = new char[length];
            for (int i = 0; i < length; i++)
            {
                var symbol = input[i];
                if (symbol == ' ')
                {
                    if (!isSpace)
                    {
                        tempArray[index++] = symbol;
                    }
                    isSpace = true;
                }
                else
                {
                    tempArray[index++] = symbol;
                    isSpace = false;
                }
            }
            value = new string(tempArray, 0, index);
        }

        return value;
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM