简体   繁体   中英

How to Remove all the HTML tags and display a plain text using C#

I want to remove all html tags from a string.i can achieve this using REGX.

but inside the string if it contains number inside the angular braces <100> it should not remove it .

         var withHtml = "<p>hello <b>there<1234></b></p>";
        var withoutHtml = Regex.Replace(withHtml, "\\<[^\\>]*\\>", string.Empty); 

Result: hello there

but needed output : hello there 1234

Not sure you can do this in one regular expression, or that a regex is really the correct way as others have suggested. A simple improvement that gets you almost there is:

Regex.Replace(withHtml, "\\<[^\\>0-9]*\\>", string.Empty);

Gives "hello there<1234>" You then just need to replace all angled brackets.

Your example of HTML isn't valid HTML since it contains a non-HTML tag. I figure you intended for the angle-brackets to be encoded.

I don't think regular expressions are suitable for HTML parsing. I recommend using an HTML parser such as HTML Agility Pack to do this.

Here's an example:

var withHtml = "<p>hello <b>there&lt;1234&gt;</b></p>";
var document = new HtmlDocument();
document.LoadHtml(withHtml);

var withoutHtml = HtmlEntity.DeEntitize(document.DocumentNode.InnerText);

Just add the HtmlAgilityPack NuGet package and a reference to System.Xml to make it work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM