简体   繁体   English

如何从 ASP.NET 中的字符串中去除 HTML 标签?

[英]How can I strip HTML tags from a string in ASP.NET?

Using ASP.NET, how can I strip the HTML tags from a given string reliably (ie not using regex)?使用 ASP.NET,如何可靠地从给定字符串中去除 HTML 标记(即不使用正则表达式)? I am looking for something like PHP's strip_tags .我正在寻找类似 PHP 的strip_tags东西。

Example:例子:

<ul><li>Hello</li></ul>

Output:输出:

"Hello" “你好”

I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.我试图不重新发明轮子,但到目前为止我还没有找到任何满足我需求的东西。

If it is just stripping all HTML tags from a string, this works reliably with regex as well.如果它只是从字符串中剥离所有HTML 标签,这也可以与正则表达式一起可靠地工作。 Replace:代替:

<[^>]*(>|$)

with the empty string, globally.使用空字符串,全局。 Don't forget to normalize the string afterwards, replacing:之后不要忘记标准化字符串,替换:

[\s\r\n]+

with a single space, and trimming the result.一个空格,并修剪结果。 Optionally replace any HTML character entities back to the actual characters.可选择将任何 HTML 字符实体替换回实际字符​​。

Note :注意

  1. There is a limitation: HTML and XML allow > in attribute values.有一个限制:HTML 和 XML 允许在属性值中使用> This solution will return broken markup when encountering such values.遇到此类值时,此解决方案返回损坏的标记。
  2. The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout.该解决方案在技术上是安全的,例如:结果永远不会包含任何可用于执行跨站点脚本或破坏页面布局的内容。 It is just not very clean.它只是不是很干净。
  3. As with all things HTML and regex:与所有 HTML 和正则表达式一样:
    Use a proper parser if you must get it right under all circumstances.如果您必须在所有情况下都正确,请使用适当的解析器

Go download HTMLAgilityPack, now!现在就去下载 HTMLAgilityPack! ;) Download LInk ;)下载链接

This allows you to load and parse HTML.这允许您加载和解析 HTML。 Then you can navigate the DOM and extract the inner values of all attributes.然后您可以导航 DOM 并提取所有属性的内部值。 Seriously, it will take you about 10 lines of code at the maximum.说真的,它最多需要大约 10 行代码。 It is one of the greatest free .net libraries out there.它是最好的免费 .net 库之一。

Here is a sample:这是一个示例:

            string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(htmlContents);
            if (doc == null) return null;

            string output = "";
            foreach (var node in doc.DocumentNode.ChildNodes)
            {
                output += node.InnerText;
            }
Regex.Replace(htmlText, "<.*?>", string.Empty);
protected string StripHtml(string Txt)
{
    return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}    

Protected Function StripHtml(Txt as String) as String
    Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function

I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there.我已经在 asp.net 论坛上发布了这个,它似乎仍然是最简单的解决方案之一。 I won't guarantee it's the fastest or most efficient, but it's pretty reliable.我不能保证它是最快或最有效的,但它非常可靠。 In .NET you can use the HTML Web Control objects themselves.在 .NET 中,您可以使用 HTML Web Control 对象本身。 All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags.您真正需要做的就是将您的字符串插入到一个临时的 HTML 对象(例如 DIV)中,然后使用内置的“InnerText”来获取所有未包含在标签中的文本。 See below for a simple C# example:请参阅下面的简单 C# 示例:


System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;

I have written a pretty fast method in c# which beats the hell out of the Regex.我在 c# 中编写了一个非常快速的方法,它击败了正则表达式。 It is hosted in an article on CodeProject.它托管在 CodeProject 上的一篇文章中。

Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like &amp;amp; and &203; ) and comment blocks replacement and more.它的优点是,除了更好的性能之外,还可以替换命名和编号的 HTML 实体(例如&amp;amp;&203; )和注释块替换等。

Please read the related article on CodeProject .请阅读有关 CodeProject相关文章

Thank you.谢谢你。

For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option.对于那些不能使用 HtmlAgilityPack 的人,可以选择 .NETs XML 阅读器。 This can fail on well formatted HTML though so always add a catch with regx as a backup.这可能会在格式良好的 HTML 上失败,所以总是添加一个带有 regx 的捕获作为备份。 Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.请注意,这并不快,但它确实为老式逐步调试提供了一个很好的机会。

public static string RemoveHTMLTags(string content)
    {
        var cleaned = string.Empty;
        try
        {
            StringBuilder textOnly = new StringBuilder();
            using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
            {
                while (reader.Read())
                {
                    if (reader.NodeType == XmlNodeType.Text)
                        textOnly.Append(reader.ReadContentAsString());
                }
            }
            cleaned = textOnly.ToString();
        }
        catch
        {
            //A tag is probably not closed. fallback to regex string clean.
            string textOnly = string.Empty;
            Regex tagRemove = new Regex(@"<[^>]*(>|$)");
            Regex compressSpaces = new Regex(@"[\s\r\n]+");
            textOnly = tagRemove.Replace(content, string.Empty);
            textOnly = compressSpaces.Replace(textOnly, " ");
            cleaned = textOnly;
        }

        return cleaned;
    }
string result = Regex.Replace(anytext, @"<(.|\n)*?>", string.Empty);

For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:对于那些抱怨 Michael Tiptop 的解决方案不起作用的人,这里是 .Net4+ 的做法:

public static string StripTags(this string markup)
{
    try
    {
        StringReader sr = new StringReader(markup);
        XPathDocument doc;
        using (XmlReader xr = XmlReader.Create(sr,
                           new XmlReaderSettings()
                           {
                               ConformanceLevel = ConformanceLevel.Fragment
                               // for multiple roots
                           }))
        {
            doc = new XPathDocument(xr);
        }

        return doc.CreateNavigator().Value; // .Value is similar to .InnerText of  
                                           //  XmlDocument or JavaScript's innerText
    }
    catch
    {
        return string.Empty;
    }
}
using System.Text.RegularExpressions;

string str = Regex.Replace(HttpUtility.HtmlDecode(HTMLString), "<.*?>", string.Empty);

You can also do this with AngleSharp which is an alternative to HtmlAgilityPack (not that HAP is bad).您也可以使用AngleSharp来执行此操作,它是 HtmlAgilityPack 的替代品(并不是说 HAP 不好)。 It is easier to use than HAP to get the text out of a HTML source.从 HTML 源中获取文本比 HAP 更容易使用。

var parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(source);
var text = htmlDocument.Body.Text();

You can take a look at the key features section where they make a case at being "better" than HAP.您可以查看关键特性部分,在那里他们证明比 HAP“更好”。 I think for the most part, it is probably overkill for the current question but still, it is an interesting alternative.我认为在大多数情况下,对于当前的问题来说,这可能是矫枉过正,但仍然是一个有趣的选择。

I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases.我看过这里建议的基于正则表达式的解决方案,除了最琐碎的情况外,它们并没有让我充满信心。 An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild.属性中的尖括号就是破坏的全部,更不用说来自野外的格式错误的 HTML 了。 And what about entities like &amp;那么像&amp;这样的实体呢&amp; ? ? If you want to convert HTML into plain text, you need to decode entities too.如果要将 HTML 转换为纯文本,还需要对实体进行解码。

So I propose the method below.所以我提出下面的方法。

Using HtmlAgilityPack , this extension method efficiently strips all HTML tags from an html fragment.使用HtmlAgilityPack ,此扩展方法有效地从 html 片段中去除所有 HTML 标记。 Also decodes HTML entities like &amp;还解码 HTML 实体,如&amp; . . Returns just the inner text items, with a new line between each text item.仅返回内部文本项,每个文本项之间有一个新行。

public static string RemoveHtmlTags(this string html)
{
        if (String.IsNullOrEmpty(html))
            return html;

        var doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(html);

        if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
        {
            return WebUtility.HtmlDecode(html);
        }

        var sb = new StringBuilder();

        var i = 0;

        foreach (var node in doc.DocumentNode.ChildNodes)
        {
            var text = node.InnerText.SafeTrim();

            if (!String.IsNullOrEmpty(text))
            {
                sb.Append(text);

                if (i < doc.DocumentNode.ChildNodes.Count - 1)
                {
                    sb.Append(Environment.NewLine);
                }
            }

            i++;
        }

        var result = sb.ToString();

        return WebUtility.HtmlDecode(result);
}

public static string SafeTrim(this string str)
{
    if (str == null)
        return null;

    return str.Trim();
}

If you are really serious, you'd want to ignore the contents of certain HTML tags too ( <script> , <style> , <svg> , <head> , <object> come to mind!) because they probably don't contain readable content in the sense we are after.如果你真的很认真,你也想忽略某些 HTML 标签的内容( <script> , <style> , <svg> , <head> , <object>浮现在脑海中!)因为它们可能不会包含我们所追求的可读内容。 What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.你在那里做什么取决于你的情况和你想要走多远,但使用 HtmlAgilityPack 将所选标签列入白名单或黑名单将非常简单。

If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - ie always encode any user-entered text that gets rendered back onto an HTML page ( > becomes &gt; etc).如果您将内容渲染回 HTML 页面,请确保您了解 XSS 漏洞以及如何防止它- 即始终对任何用户输入的文本进行编码,这些文本被渲染回 HTML 页面( >变为&gt;等)。

For the second parameter,ie keep some tags, you may need some code like this by using HTMLagilityPack:对于第二个参数,即保留一些标签,您可能需要使用 HTMLagilityPack 来编写一些这样的代码:

public string StripTags(HtmlNode documentNode, IList keepTags)
{
    var result = new StringBuilder();
        foreach (var childNode in documentNode.ChildNodes)
        {
            if (childNode.Name.ToLower() == "#text")
            {
                result.Append(childNode.InnerText);
            }
            else
            {
                if (!keepTags.Contains(childNode.Name.ToLower()))
                {
                    result.Append(StripTags(childNode, keepTags));
                }
                else
                {
                    result.Append(childNode.OuterHtml.Replace(childNode.InnerHtml, StripTags(childNode, keepTags)));
                }
            }
        }
        return result.ToString();
    }

More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/此页面上的更多解释: http : //nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/

只需使用string.StripHTML();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在ASP.NET MVC中使用带有引导程序的HTML标记从控制器获取结果? - How I can use html tags with bootstrap in asp.net mvc for get a result from controller? 使用 ASP.NET Core 3.1,如何从页面 URL 中剥离主页 controller? - With ASP.NET Core 3.1, how can I strip the Home controller from a page URL? 从保存在数据库中的字符串中删除HTML标签-ASP.NET - Remove HTML tags from string saved in database - ASP.NET 如何在 asp.net 中将字符串呈现为 html? - How can I render string as html in asp.net? 如何在没有Html.BeginForm的情况下使用html标记进行输入和在asp.net mvc 5中提交? - How I can use html tags for input and submit in asp.net mvc 5 without Html.BeginForm? 我如何禁用 <select>使用asp.net下拉列表的代码(html下拉列表) - How can i Disable <select> tags (html dropdown) using asp.net dropdown list 如何从Asp.net中的HTML源中删除丢失的标签 - How to remove missing tags from HTML source in Asp.net 如何从数据库渲染gridview asp.net上的html标签 - how to render html tags on gridview asp.net from database 如何从.NET中的文本中删除HTML? - How Can I strip HTML from Text in .NET? 如何防止最终用户更改 asp.net 内核中 html 标签的值? - how do I prevent end-user to change values from html tags in asp.net core?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM