[英]How to Remove all the HTML tags and display a plain text using C#
I want to remove all html tags from a string.i can achieve this using REGX. 我想从字符串中删除所有html标签。我可以使用REGX实现此目的。
but inside the string if it contains number inside the angular braces <100> it should not remove it . 但是,如果在字符串中包含尖括号<100>内的数字,则不应将其删除。
var withHtml = "<p>hello <b>there<1234></b></p>";
var withoutHtml = Regex.Replace(withHtml, "\\<[^\\>]*\\>", string.Empty);
Result: hello there 结果:你好
but needed output : hello there 1234 但需要输出:你好,那里1234
Not sure you can do this in one regular expression, or that a regex is really the correct way as others have suggested. 不确定您可以使用一个正则表达式来执行此操作,还是不确定正则表达式是否确实如其他人所建议的那样正确。 A simple improvement that gets you almost there is:
一个简单的改进使您几乎可以达到:
Regex.Replace(withHtml, "\\<[^\\>0-9]*\\>", string.Empty);
Gives "hello there<1234>" You then just need to replace all angled brackets. 给出“ hello there <1234>”,然后只需要替换所有尖括号即可。
Your example of HTML isn't valid HTML since it contains a non-HTML tag. 您的HTML示例不是有效的HTML,因为它包含非HTML标签。 I figure you intended for the angle-brackets to be encoded.
我认为您打算对尖括号进行编码。
I don't think regular expressions are suitable for HTML parsing. 我认为正则表达式不适合HTML解析。 I recommend using an HTML parser such as HTML Agility Pack to do this.
我建议使用HTML解析器(例如HTML Agility Pack)来执行此操作。
Here's an example: 这是一个例子:
var withHtml = "<p>hello <b>there<1234></b></p>";
var document = new HtmlDocument();
document.LoadHtml(withHtml);
var withoutHtml = HtmlEntity.DeEntitize(document.DocumentNode.InnerText);
Just add the HtmlAgilityPack NuGet package and a reference to System.Xml to make it work. 只需添加HtmlAgilityPack NuGet包和对System.Xml的引用即可使其正常工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.