简体   繁体   English

如何使用 C# 验证字符串不包含 HTML

[英]How to validate that a string doesn't contain HTML using C#

Does anyone have a simple, efficient way of checking that a string doesn't contain HTML?有没有人有一种简单有效的方法来检查字符串是否包含 HTML? Basically, I want to check that certain fields only contain plain text.基本上,我想检查某些字段是否仅包含纯文本。 I thought about looking for the < character, but that can easily be used in plain text.我想寻找 < 字符,但它可以很容易地在纯文本中使用。 Another way might be to create a new System.Xml.Linq.XElement using:另一种方法可能是使用以下方法创建一个新的 System.Xml.Linq.XElement:

XElement.Parse("<wrapper>" + MyString + "</wrapper>")

and check that the XElement contains no child elements, but this seems a little heavyweight for what I need.并检查 XElement 是否不包含子元素,但这对于我需要的东西来说似乎有点重量级。

The following will match any matching set of tags.以下将匹配任何匹配的标签集。 ie <b>this</b>即<b>这个</b>

Regex tagRegex = new Regex(@"<\s*([^ >]+)[^>]*>.*?<\s*/\s*\1\s*>");

The following will match any single tag.以下将匹配任何单个标签。 ie <b> (it doesn't have to be closed).即 <b> (它不必关闭)。

Regex tagRegex = new Regex(@"<[^>]+>");

You can then use it like so然后你可以像这样使用它

bool hasTags = tagRegex.IsMatch(myString);

You could ensure plain text by encoding the input using HttpUtility.HtmlEncode .您可以通过使用HttpUtility.HtmlEncode对输入进行编码来确保纯文本。

In fact, depending on how strict you want the check to be, you could use it to determine if the string contains HTML:事实上,根据您希望检查的严格程度,您可以使用它来确定字符串是否包含 HTML:

bool containsHTML = (myString != HttpUtility.HtmlEncode(myString));

Here you go:干得好:

using System.Text.RegularExpressions;
private bool ContainsHTML(string checkString)
{
  return Regex.IsMatch(checkString, "<(.|\n)*?>");
}

That is the simplest way, since items in brackets are unlikely to occur naturally.这是最简单的方法,因为括号中的项目不太可能自然发生。

I just tried my XElement.Parse solution.我刚刚尝试了我的 XElement.Parse 解决方案。 I created an extension method on the string class so I can reuse the code easily:我在字符串类上创建了一个扩展方法,以便我可以轻松地重用代码:

public static bool ContainsXHTML(this string input)
{
    try
    {
        XElement x = XElement.Parse("<wrapper>" + input + "</wrapper>");
        return !(x.DescendantNodes().Count() == 1 && x.DescendantNodes().First().NodeType == XmlNodeType.Text);
    }
    catch (XmlException ex)
    {
        return true;
    }
}

One problem I found was that plain text ampersand and less than characters cause an XmlException and indicate that the field contains HTML (which is wrong).我发现的一个问题是纯文本与符号和小于字符会导致 XmlException 并指示该字段包含 HTML(这是错误的)。 To fix this, the input string passed in first needs to have the ampersands and less than characters converted to their equivalent XHTML entities.为了解决这个问题,传入的输入字符串首先需要将&符号和小于字符转换为它们等效的 XHTML 实体。 I wrote another extension method to do that:我写了另一个扩展方法来做到这一点:

public static string ConvertXHTMLEntities(this string input)
{
    // Convert all ampersands to the ampersand entity.
    string output = input;
    output = output.Replace("&amp;", "amp_token");
    output = output.Replace("&", "&amp;");
    output = output.Replace("amp_token", "&amp;");

    // Convert less than to the less than entity (without messing up tags).
    output = output.Replace("< ", "&lt; ");
    return output;
}

Now I can take a user submitted string and check that it doesn't contain HTML using the following code:现在我可以使用用户提交的字符串并使用以下代码检查它是否不包含 HTML:

bool ContainsHTML = UserEnteredString.ConvertXHTMLEntities().ContainsXHTML();

I'm not sure if this is bullet proof, but I think it's good enough for my situation.我不确定这是否是防弹的,但我认为这对我的情况来说已经足够了。

this also checks for things like < br /> self enclosed tags with optional whitespace.这也会检查诸如 < br /> 带有可选空格的自封闭标签之类的东西。 the list does not contain new html5 tags.该列表不包含新的 html5 标签。

internal static class HtmlExts
{
    public static bool containsHtmlTag(this string text, string tag)
    {
        var pattern = @"<\s*" + tag + @"\s*\/?>";
        return Regex.IsMatch(text, pattern, RegexOptions.IgnoreCase);
    }

    public static bool containsHtmlTags(this string text, string tags)
    {
        var ba = tags.Split('|').Select(x => new {tag = x, hastag = text.containsHtmlTag(x)}).Where(x => x.hastag);

        return ba.Count() > 0;
    }

    public static bool containsHtmlTags(this string text)
    {
        return
            text.containsHtmlTags(
                "a|abbr|acronym|address|area|b|base|bdo|big|blockquote|body|br|button|caption|cite|code|col|colgroup|dd|del|dfn|div|dl|DOCTYPE|dt|em|fieldset|form|h1|h2|h3|h4|h5|h6|head|html|hr|i|img|input|ins|kbd|label|legend|li|link|map|meta|noscript|object|ol|optgroup|option|p|param|pre|q|samp|script|select|small|span|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|ul|var");
    }
}

Angle brackets may not be your only challenge.尖括号可能不是您唯一的挑战。 Other characters can also be potentially harmful script injection.其他字符也可能是潜在有害的脚本注入。 Such as the common double hyphen "--", which can also used in SQL injection.比如常见的双连字符“--”,也可以用在SQL注入中。 And there are others.还有其他人。

On an ASP.Net page, if validateRequest = true in machine.config, web.config or the page directive, the user will get an error page stating "A potentially dangerous Request.Form value was detected from the client" if an HTML tag or various other potential script-injection attacks are detected.在 ASP.Net 页面上,如果 machine.config、web.config 或页面指令中的 validateRequest = true,则用户将收到一个错误页面,指出“从客户端检测到潜在危险的 Request.Form 值”(如果 HTML 标记)或检测到其他各种潜在的脚本注入攻击。 You probably want to avoid this and provide a more elegant, less-scary UI experience.您可能希望避免这种情况并提供更优雅、更不可怕的 UI 体验。

You could test for both the opening and closing tags <> using a regular expression, and allow the text if only one of them occcurs.您可以使用正则表达式测试开始和结束标记 <>,如果只有其中一个出现,则允许文本。 Allow < or >, but not < followed by some text and then >, in that order.允许 < 或 >,但不允许 < 后跟一些文本,然后是 >,按此顺序。

You could allow angle brackets and HtmlEncode the text to preserve them when the data is persisted.您可以允许尖括号和 HtmlEncode 文本在数据持久化时保留它们。

Beware when using the HttpUtility.HtmlEncode method mentioned above.使用上面提到的 HttpUtility.HtmlEncode 方法时要小心。 If you are checking some text with special characters, but not HTML, it will evaluate incorrectly.如果您正在检查一些带有特殊字符而不是 HTML 的文本,它将错误地评估。 Maybe that's why J c used "...depending on how strict you want the check to be..."也许这就是为什么 J c 使用“...取决于您希望检查的严格程度...”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 C#Regex匹配不包含某个字符串的字符串? - C# Regex to match a string that doesn't contain a certain string? 查看字典字符串值是否包含某些模式并且同时在C#中不包含其他模式的最佳方法是什么? - What is the best way to see if Dictionary string values contain some pattern and at the same time doesn't contain another in C#? 如何在C#中使用正则表达式编写正则表达式以验证字符串 - how to write regular expression to validate a string using regex in C# 如何在C#中使用EmailAddressAttribute验证字符串列表? - How to validate list of string using EmailAddressAttribute in C#? 使用C#验证HTML电子邮件中的VML - Validate VML in HTML emails using C# 如何使用C#解码此html字符串? - How to decode this html string using C#? 使用C#中的XSD字符串验证XmlDocument? - Validate an XmlDocument using an XSD String in C#? 使用Ajax将C#位图转换为HTML img无效 - C# Bitmap to Html img using ajax doesn't work 使用套接字(C#)进行文件传输-接收的文件不包含完整数据 - File transfer using sockets (C#) - received file doesn't contain full data System.EventArgs不包含Button的定义-C# - System.EventArgs Doesn't contain the definition of Button - C#
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM