简体   繁体   中英

Using a Regex to clean string versus Base64 Encoded string

I have a extension method that is using a Regex.Replace to clean up invalid characters in an user-entered string before it is added to a XML document.

The intent of the regex is to strip out some random hi-ASCII characters that are occasionally in the input when the user pastes text from Microsoft Word and replace them with a space:

    public static string CleanInput(this string inputString) {
        if (string.IsNullOrEmpty(inputString))
            return string.Empty;

        // Replace invalid characters with a space.
        return Regex.Replace(inputString, @"[^\w\.@-]", " ");
    }

Now as fate would have it, someone is now using this extension method on a string that contains base64-encoded data.

What I believe is that the regex will leave MOST of the base64 data unmodified, however I think it is might be changing some of it.

So - knowing that \\w in the regex is matching [A-Za-z0-9_] and that Base64 effectively the same range, should this regex be changing the string or not?

If it is changing the string, why and how would you change it so that hi-ASCII garbage is still cleaned up in regular non-encoded text without mucking up the encoded string.

Base64 also uses + , / , and = .

You can add these to your character class:

[^\w\.@+/=-]

Note that - has to be last in order for it to be a literal hyphen-minus instead of specifying a range.

It may also be worth considering that \\w isn't necessarily the same as [A-Za-z0-9_] according to Microsoft .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM