如何使用C＃删除所有HTML标记并显示纯文本

Question

I want to remove all html tags from a string.i can achieve this using REGX. 我想从字符串中删除所有html标签。我可以使用REGX实现此目的。

but inside the string if it contains number inside the angular braces <100> it should not remove it . 但是，如果在字符串中包含尖括号<100>内的数字，则不应将其删除。

         var withHtml = "<p>hello <b>there<1234></b></p>";
        var withoutHtml = Regex.Replace(withHtml, "\\<[^\\>]*\\>", string.Empty);

Result: hello there 结果：你好

but needed output : hello there 1234 但需要输出：你好，那里1234

Answer 1

Not sure you can do this in one regular expression, or that a regex is really the correct way as others have suggested. 不确定您可以使用一个正则表达式来执行此操作，还是不确定正则表达式是否确实如其他人所建议的那样正确。 A simple improvement that gets you almost there is: 一个简单的改进使您几乎可以达到：

Regex.Replace(withHtml, "\\<[^\\>0-9]*\\>", string.Empty);

Gives "hello there<1234>" You then just need to replace all angled brackets. 给出“ hello there <1234>”，然后只需要替换所有尖括号即可。

Answer 2

Your example of HTML isn't valid HTML since it contains a non-HTML tag. 您的HTML示例不是有效的HTML，因为它包含非HTML标签。 I figure you intended for the angle-brackets to be encoded. 我认为您打算对尖括号进行编码。

I don't think regular expressions are suitable for HTML parsing. 我认为正则表达式不适合HTML解析。 I recommend using an HTML parser such as HTML Agility Pack to do this. 我建议使用HTML解析器（例如HTML Agility Pack）来执行此操作。

Here's an example: 这是一个例子：

var withHtml = "<p>hello <b>there&lt;1234&gt;</b></p>";
var document = new HtmlDocument();
document.LoadHtml(withHtml);

var withoutHtml = HtmlEntity.DeEntitize(document.DocumentNode.InnerText);

Just add the HtmlAgilityPack NuGet package and a reference to System.Xml to make it work. 只需添加HtmlAgilityPack NuGet包和对System.Xml的引用即可使其正常工作。

如何使用C＃删除所有HTML标记并显示纯文本

问题描述

2 个解决方案

解决方案1
0 2013-08-29 10:04:29

解决方案2
0 2013-08-29 10:35:10

如何使用C＃删除所有HTML标记并显示纯文本

问题描述

2 个解决方案

解决方案1 0 2013-08-29 10:04:29

解决方案2 0 2013-08-29 10:35:10

解决方案1
0 2013-08-29 10:04:29

解决方案2
0 2013-08-29 10:35:10