[英]Using a Regex to clean string versus Base64 Encoded string
I have a extension method that is using a Regex.Replace
to clean up invalid characters in an user-entered string before it is added to a XML document. 我有一个扩展方法,该方法使用Regex.Replace
在将用户输入的字符串中的无效字符添加到XML文档之前将其清除。
The intent of the regex is to strip out some random hi-ASCII characters that are occasionally in the input when the user pastes text from Microsoft Word and replace them with a space: 正则表达式的目的是去除用户从Microsoft Word粘贴文本并将其替换为空格时在输入中偶尔出现的一些随机的hi-ASCII字符:
public static string CleanInput(this string inputString) {
if (string.IsNullOrEmpty(inputString))
return string.Empty;
// Replace invalid characters with a space.
return Regex.Replace(inputString, @"[^\w\.@-]", " ");
}
Now as fate would have it, someone is now using this extension method on a string that contains base64-encoded data. 现在,就像命运那样,有人正在对包含base64编码数据的字符串使用此扩展方法。
What I believe is that the regex will leave MOST of the base64 data unmodified, however I think it is might be changing some of it. 我相信的是,正则表达式将离开科技部的base64数据未经修改的,但是我认为这是可能会改变一些。
So - knowing that \\w
in the regex is matching [A-Za-z0-9_]
and that Base64 effectively the same range, should this regex be changing the string or not? 所以-明知\\w
在正则表达式是匹配[A-Za-z0-9_]
和Base64的有效范围相同,这应该是正则表达式改变字符串或不是?
If it is changing the string, why and how would you change it so that hi-ASCII garbage is still cleaned up in regular non-encoded text without mucking up the encoded string. 如果要更改字符串,为什么以及如何更改它,以便仍以常规的非编码文本清除hi-ASCII垃圾,而不会破坏编码的字符串。
Base64 also uses +
, /
, and =
. Base64还使用+
, /
和=
。
You can add these to your character class: 您可以将这些添加到您的角色类中:
[^\w\.@+/=-]
Note that -
has to be last in order for it to be a literal hyphen-minus instead of specifying a range. 请注意, -
必须为最后,才能使其为文字连字符减号,而不是指定范围。
It may also be worth considering that \\w
isn't necessarily the same as [A-Za-z0-9_]
according to Microsoft . 根据Microsoft的说法 , \\w
不一定与[A-Za-z0-9_]
相同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.