[英]Remove excessive whitespace in user input field
In my controller method for handling a (potentially hostile) user input field I have the following code: 在用于处理(可能是敌对的)用户输入字段的控制器方法中,我具有以下代码:
string tmptext = comment.Replace(System.Environment.NewLine, "{break was here}"); //marks line breaks for later re-insertion
tmptext = Encoder.HtmlEncode(tmptext);
//other sanitizing goes in here
tmptext = tmptext.Replace("{break was here}", "<br />");
var regex = new Regex("(<br /><br />)\\1+");
tmptext = regex.Replace(tmptext, "$1");
My goal is to preserve line breaks for typical non-malicious use and display user input in safe, htmlencoded strings. 我的目标是保留换行符以供典型的非恶意使用,并以安全的html编码字符串显示用户输入。 I take the user input, parse it for newline characters and place a delimiter at the line breaks.
我接受用户输入,将其解析为换行符,并在换行符处放置定界符。 I perform the HTML encoding and reinsert the breaks.
我执行HTML编码,然后重新插入中断。 (i will likely change this to reinserting paragraphs as p tags instead of br, but for now i'm using br)
(我可能会将其更改为将段落重新插入为p标签而不是br,但现在我正在使用br)
Now actually inserting real html breaks opens me up to a subtle vulnerability: the enter key. 现在,实际上插入真正的html中断会使我面临一个微妙的漏洞:Enter键。 The regex.replace code is there to strip out a malicious user just standing on the enter key and filling the page with crap.
那里的regex.replace代码可以去除恶意用户,他们只是站在enter键上并用废话填满页面。
This is a fix for big crap floods of just white but still leaves me open to abuse like entering one character, two line breaks, one character, two line breaks all down the page. 这是针对白色的废话泛滥的一种解决方法,但仍然使我容易受到滥用,例如在页面中输入一个字符,两个换行符,一个字符,两个换行符。
My question is for a method of determining that this is abusive and failing it on validation. 我的问题是一种确定该方法是否滥用并在验证时失败的方法。 I'm scared that there might not be a simple procedural method to do it and instead will need heuristic techniques or bayesian filters.
我担心可能没有简单的过程方法可以执行此操作,而是需要启发式技术或贝叶斯过滤器。 Hopefully, someone has an easier, better way.
希望有人有一个更简单,更好的方法。
EDIT: perhaps I wasn't clear in the problem description, the regex handles seeing multiple line breaks in a row and converting them to just one or two. 编辑:也许我在问题描述中不清楚,正则表达式可以处理连续看到多个换行符并将其转换为一个或两个的情况。 That problem is solved.
这个问题解决了。 The real problem is distinguishing legitimate text from crap flood like this:
真正的问题是要区分合法文本和垃圾邮件,如下所示:
a 一种
a 一种
a 一种
...imagine 1000 of these... 想象其中的1000个
a 一种
a 一种
a 一种
a 一种
It sounds like you're tempted to try something "clever" with a regex, but IMO the simplest approach is to just loop through the characters of the string copying them to a StringBuilder, filtering as you go. 听起来您很想使用正则表达式尝试“巧妙地”操作,但是IMO最简单的方法是遍历字符串中的字符,将它们复制到StringBuilder中,然后进行过滤。
Any that fail a char.IsWhiteSpace() test are not copied. 任何未通过char.IsWhiteSpace()测试的内容都不会被复制。 (If one of these is a newline, then insert a <br/> and don't allow any more <br/>'s to be added until you have hit a non-whitespace character).
(如果其中之一是换行符,请插入<br/>,并且在遇到非空白字符之前,不允许再添加<br/>)。
edit 编辑
If you want to stop the user entering any old crap, give up now. 如果要停止用户输入任何旧内容,请立即放弃。 You will never find a way filtering that a user can't find a way around in less than a minute, if they really want to.
如果用户真的愿意,您将永远找不到一种过滤方法,使用户在不到一分钟的时间内找不到解决方法。
You will be much better off putting a limit on the number of newlines, or the total number of characters, in the input. 您最好限制输入中的换行符或字符总数。
Think of how much effort it will take to do something clever to sanitise "bad input", and then consider how likely it is that this will happen. 考虑做些聪明的事情来清理“错误的输入”将花费多少精力,然后考虑这种情况发生的可能性。 Probbaly there is no point.
可能没有意义。 Probably all the sanitisation you really need is to ensure the data is legal (not too large for your system to handle, all dangerous characters stripped or escaped, etc).
可能您真正需要的所有消毒措施都是确保数据合法(对于您的系统来说,它不是太大,不能删除或转义所有危险字符,等等)。 (This is exactly why forums have human moderators who can filter the posts based on whatever criteria are approriate).
(这就是论坛拥有人工主持人的原因,他们可以根据适当的条件过滤帖子)。
I would HttpUtility.HtmlEncode
the string, then convert newline characters to <br/>
. 我将对字符串进行
HttpUtility.HtmlEncode
,然后将换行符转换为<br/>
。
HttpUtility.HtmlEncode(subject).Replace("\r\n", "<br/>").Replace("\r", "<br/>").Replace("\n", "<br/>");
Also you should perform this logic when you are outputting to the user, not when saving in the database. 同样,在输出给用户时,而不是保存在数据库中时,应该执行此逻辑。 The only validation I do on the database is make sure it's properly escaped (other than normal business rules that is).
我对数据库所做的唯一验证是确保已正确转义了该数据库(而不是正常的业务规则)。
EDIT : To fix the actual problem however, you can use Regex to replace multiple newlines with a single newline beforehand. 编辑 :但是,要解决实际问题,您可以使用Regex预先用单个换行符替换多个换行符。
subject = Regex.Replace(@"(\r\n|\r|\n)+", @"\n", RegexOptions.Singleline);
I'm not sure if you would need RegexOptions.Singleline
. 我不确定您是否需要
RegexOptions.Singleline
。
Rather than attempting to replace the newlines with filtered text and then attempting to use regular expressions on that, why not sanitize your data before inserting the <br />
tags? 与其尝试用过滤后的文本替换换行符,然后尝试使用正则表达式,不如在插入
<br />
标记之前不对数据进行清理? Don't forget to sanitize the input with HttpUtility.HtmlEncode
first. 不要忘记先使用
HttpUtility.HtmlEncode
清理输入。
In an attempt to take care of multiple short lines in a row, here's my best attempt: 为了尝试连续处理多个短行,这是我的最佳尝试:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class Program {
static void Main() {
// Arbirary cutoff used to join short strings.
const int Cutoff = 6;
string input =
"\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome\r\n" +
"unsanatized\r\nbreaks\r\nand\ra\nsh\nor\nt\r\n\na\na\na\na" +
"\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na";
input = (input ?? String.Empty).Trim(); // Don't forget to HtmlEncode it.
StringBuilder temp = new StringBuilder();
List<string> result = new List<string>();
var items = input.Split(
new[] { '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries)
.Select(i => new { i.Length, Value = i });
foreach (var item in items) {
if (item.Length > Cutoff) {
if (temp.Length > 0) {
result.Add(temp.ToString());
temp.Clear();
}
result.Add(item.Value);
continue;
}
if (temp.Length > 0) { temp.Append(" "); }
temp.Append(item.Value);
}
if (temp.Length > 0) {
result.Add(temp.ToString());
}
Console.WriteLine(String.Join("<br />", result));
}
}
Produces the following output: 产生以下输出:
thisisatest<br />string with some<br />unsanatized<br />breaks and a sh or t a a
a a a a a a a a a a a a a a a a a a a
I'm sure you've already come up with this solution but unfortunately what you're asking for isn't very straight forward. 我确定您已经提出了此解决方案,但是不幸的是,您要的不是很简单。
For those interested, here's my first attempt: 对于那些感兴趣的人,这是我的第一次尝试:
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
string input = "\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome" +
"\r\nunsanatized\r\nbreaks\r\n\r\n";
input = (input ?? String.Empty).Trim().Replace("\r", String.Empty);
string output = Regex.Replace(
input,
"\\\n+",
"<br />",
RegexOptions.Multiline);
Console.WriteLine(output);
}
}
producing the following output: 产生以下输出:
thisisatest<br />string<br />with<br />some<br />unsanatized<br />breaks
This is not the most efficient way of handling this, nor the smartest (disclaimer), 这不是处理此问题的最有效方法,也不是最聪明的(免责声明),
but if your text is not too big it doesn't matter much and short of any smarter algorithms (note: it's hard to detect something like char\\nchar\\nchar\\n...
though you could set a limit on the line len) 但是,如果您的文字不是太大 ,就没有多大意义,而且缺少任何更智能的算法(请注意:尽管您可以在行len上设置一个限制,但是很难检测到
char\\nchar\\nchar\\n...
类的东西)
You could just Split
on white characters (add any you can think of, short of \\n) - then Join
with just one space 您可以只用白色字符
Split
(添加所有您能想到的,少于\\ n的字符)-然后仅Join
一个空格 and then split on
\\n
(to get lines) - join with <br />
. 然后在
\\n
上分割(以获取行)-与<br />
加入。 While joining the lines you can test for line.Length > 2
eg or something. 连接线时,您可以测试
line.Length > 2
例如。
To make this faster you can iterate with a more efficient algorithm, char by char, using IndexOf etc.. 为了使其更快,您可以使用更高效的算法,逐个字符,使用IndexOf等进行迭代。
Again not the most efficient or perfect way of handling this but would give you something fast. 同样,这不是处理此问题的最有效或最完美的方法,但可以快速为您提供帮助。
EDIT: to filter 'same lines' - you could use eg DistinctUntilChanged
- that's from the Ix - Interactive extensions
(see NuGet Ix-experimental I think) which should filter 'same lines' consecutive + you could add line test for those. 编辑:过滤“相同的线”-您可以使用例如
DistinctUntilChanged
来自Ix - Interactive extensions
(请参阅我认为的NuGet Ix实验),它应该连续过滤“相同的线” +您可以为这些添加线测试。
受到slashdot.org的评论过滤器启发的随机建议:使用System.IO.Compression.DeflateStream压缩用户输入,并且如果与原始输入相比太小(您必须做一些实验才能找到有用的截止)拒绝它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.