简体   繁体   English

如何使用C#删除所有HTML标记并显示纯文本

[英]How to Remove all the HTML tags and display a plain text using C#

I want to remove all html tags from a string.i can achieve this using REGX. 我想从字符串中删除所有html标签。我可以使用REGX实现此目的。

but inside the string if it contains number inside the angular braces <100> it should not remove it . 但是,如果在字符串中包含尖括号<100>内的数字,则不应将其删除。

         var withHtml = "<p>hello <b>there<1234></b></p>";
        var withoutHtml = Regex.Replace(withHtml, "\\<[^\\>]*\\>", string.Empty); 

Result: hello there 结果:你好

but needed output : hello there 1234 但需要输出:你好,那里1234

Not sure you can do this in one regular expression, or that a regex is really the correct way as others have suggested. 不确定您可以使用一个正则表达式来执行此操作,还是不确定正则表达式是否确实如其他人所建议的那样正确。 A simple improvement that gets you almost there is: 一个简单的改进使您几乎可以达到:

Regex.Replace(withHtml, "\\<[^\\>0-9]*\\>", string.Empty);

Gives "hello there<1234>" You then just need to replace all angled brackets. 给出“ hello there <1234>”,然后只需要替换所有尖括号即可。

Your example of HTML isn't valid HTML since it contains a non-HTML tag. 您的HTML示例不是有效的HTML,因为它包含非HTML标签。 I figure you intended for the angle-brackets to be encoded. 我认为您打算对尖括号进行编码。

I don't think regular expressions are suitable for HTML parsing. 我认为正则表达式不适合HTML解析。 I recommend using an HTML parser such as HTML Agility Pack to do this. 我建议使用HTML解析器(例如HTML Agility Pack)来执行此操作。

Here's an example: 这是一个例子:

var withHtml = "<p>hello <b>there&lt;1234&gt;</b></p>";
var document = new HtmlDocument();
document.LoadHtml(withHtml);

var withoutHtml = HtmlEntity.DeEntitize(document.DocumentNode.InnerText);

Just add the HtmlAgilityPack NuGet package and a reference to System.Xml to make it work. 只需添加HtmlAgilityPack NuGet包和对System.Xml的引用即可使其正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM