简体   繁体   中英

Better string replacing in C#

Does anyone have an idea which would be better for potential string replacement?

If I have a collection of varying length of strings of varying lengths in which some strings might need special replacement of encoded hex values (eg =0A, %20... etc)

The "replacements" (there could be multiple) for each string would be handled by a Regular Expression to detect the appropriate escaped hex values

Which would be more efficient?

  1. To simply run the replacement on every string in the collection ensuring by brute force that all needed replacements are done

  2. To perform a test if a replacement is needed and only run the replacement on the strings that need it.

I'm working in C#.

Update

A little additional info from the answers and comments.

This is primarily for VCARD processing that is loaded from a QR Code

I currently have a regex that uses capture groups to get the KEY, PARAMETERS and VALUE from each KEY;PARAMETERS:VALUE in the VCARD.

Since i'm supporting v 2.1 and 3.0 the encoding and line folding are VERY different so I need to know the version before I decode.

Now it doesn't make sense to me to run the entire regular expression JUST to get the version and apply the approptiate replace to the whole vcard block of text and THEN rerun the regular expression.

To me it makes more sense to just get my capture groups loaded up then snag the version and do the appropriate decoding replacement on each match

When you just Replace it will perform slightly slower when there's No Match because of the additional checks that Replace does (eg)

if (replacement == null)
{
    throw new ArgumentNullException("replacement");
}

Regex.Replace does return the input if no matches are found so there's no memory issue here.

Match match = regex.Match(input, startat);
if (!match.Success)
{
    return input;
}

When there is a match the regex.Match fires twice once when you do it and again when replace does it. Which means Check and Replace will perform slower then.

So your results will be based on

  • Do you expect a lot of matches or a lot of misses?
  • When there are matches how does the fact that the Regex.Match will run twice overwelm the extra parameter checks? My guess is it probably will.

You could use something along the lines of a very specialized lexer with look-forward checking, eg ,

outputBuffer := new StringBuilder
index := 0
max := input.Length
while index < max
    if input[ index ] == '%'
    && IsHexDigit( input[ index + 1 ] )
    && IsHexDigit( input[ index + 2 ] )
        outputBuffer.Append( ( char )int.Parse( input.Substring( index + 1, 2 )
        index += 3
        continue
    else
        outputBuffer.Append( input[ index ] )
        index ++;
        continue

If you go with string replacement, it may be better to use StringBuilder.Replace than string.Replace. (Will not create many temporary strings while replacing....)

(Posted on behalf of the question author) .

Taking some inspiration from some of the fine folks who chimed in, I managed to isolate and test the code in question.

In both cases I have a Parser Regex that handles breaking up each "line" of the vcard and a Decode Regex that handles capturing any encoded Hex numbers.

It occurred to me that regardless of my use of string.Replace or not I still had to depend on the Decode Regex to pick up the potential replacement hex codes.

I ran through several different scenarios to see if the numbers would change; including: Casting the Regex MatchCollection to a Dictionary to remove the complexity of the Match object and projecting the Decoding regex into a collection of distinct simple anonymous object with an Old and New string value for simple string.Replace calls

In the end no matter how I massaged the test using the String.Replace it came close but was always slower that letting the Decoded Regex do it's Replace thing.

The closest was about a 12% difference in speed.

In the end for those curious this is what ended up as the winning block of code

    var ParsedCollection = Parser.Matches(UnfoldedEncodeString).Cast<Match>().Select(m => new
    {
      Field = m.Groups["FIELD"].Value,
      Params = m.Groups["PARAM"].Captures.Cast<Capture>().Select(c => c.Value),
      Encoding = m.Groups["ENCODING"].Value,
      Content = m.Groups["ENCODING"].Value.FirstOrDefault == 'Q' ? QuotePrintableDecodingParser.Replace(m.Groups["CONTENT"].Value, me => Convert.ToChar(Convert.ToInt32(me.Groups["HEX"].Value, 16)).ToString()) : m.Groups["CONTENT"].Value,
      Base64Content = ((m.Groups["ENCODING"].Value.FirstOrDefault() == 'B') ? Convert.FromBase64String(m.Groups["CONTENT"].Value.Trim()) : null)
    });

Gives me everything I need in one shot. All the Fields, their values, any parameters and the two most common encodings decoded all projected into a nicely packaged anonymous object.

and on the plus side only a little over 1000 nano seconds from string to parsed and decoded Anonymous Object (thank goodness for LINQ and extension methods) (based on 100,000 tests with around 4,000 length VCARD).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM