c# strip html tags, decode entities

Question

Is there any equivalents to PHP functions strip_tags and html_entity_decode? I'm using .NET 3.5

So if I have:

<textarea cols="5">Some &lt; text</textarea>

I'll get

Some < text

Thanks for respondes.

Answer 1

You can use HtmlAgilityPack ...

string html = @"<textarea cols=""5"">Some &lt; text</textarea>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var text = doc.DocumentNode.Descendants("textarea").First().InnerText;
var decodedText = HttpUtility.HtmlDecode(text);

Answer 2

I wanted to share my code that I created to do this. I love PHP but my job is in C# so I recreated StripTag functionality.

example on how to use it:

string exampleOneWithAllStripped = StripTag("<br />this is an <b>example</b>", null);

string exampleTwoWithOnlyBoldAllowed = StripTag("<br />this is an <b>example</b>", "b");

string exampleThreeWithBRandBoldAllowed = StripTag("<br />this is an <b>example</b>", "b,<br>");

    /// <summary>
    ///     HTML and other mark up tags stripped from a given the string ListOfAllowedTags.
    ///     This Method is the ASP.NET Version of the PHP Strip_Tags Method. It will strip out all html and xml tags
    ///     except for the ones explicitly allowed in the second parameter.  If allowed, this method DOES NOT strip out
    ///     attributes.
    /// </summary>
    /// <param name="htmlString">
    ///     The HTML string.
    /// </param>
    /// <param name="listOfAllowedTags">
    ///     The list of allowed tags.  if null, then nothing allowed.  otherwise, ex: "b,<br/>,<hr>,p,i,<u>"
    /// </param>
    /// <returns>
    ///     Cleaned String
    /// </returns>
    /// <author>James R.</author>
    /// <createdate>10-27-2011</createdate>
    public static string StripTag(string htmlString, string listOfAllowedTags)
    {
        if (string.IsNullOrEmpty(htmlString))
        {
            return htmlString;
        }

        // this is the reg pattern that will retrieve all tags
        string patternThatGetsAllTags = "</?[^><]+>";

        // Create the Regex for all of the Allowed Tags
        string patternForTagsThatAreAllowed = string.Empty;
        if (!string.IsNullOrEmpty(listOfAllowedTags))
        {
            // get the HTML starting tag, such as p,i,b from an example string of <p>,<i>,<b>
            Regex remove = new Regex("[<>\\/ ]+");

            // now strip out /\<> and spaces
            listOfAllowedTags = remove.Replace(listOfAllowedTags, string.Empty);

            // split at the commas
            string[] listOfAllowedTagsArray = listOfAllowedTags.Split(',');

            foreach (string allowedTag in listOfAllowedTagsArray)
            {
                if (string.IsNullOrEmpty(allowedTag))
                {
                    // jump to next element of array.
                    continue;
                }

                string patternVersion1 = "<" + allowedTag + ">"; // <p>
                string patternVersion2 = "<" + allowedTag + " [^><]*>$";

                // <img src=stuff  or <hr style="width:50%;" />
                string patternVersion3 = "</" + allowedTag + ">"; // closing tag

                // if it is not the first time, then add the pipe | to the end of the string
                if (!string.IsNullOrEmpty(patternForTagsThatAreAllowed))
                {
                    patternForTagsThatAreAllowed += "|";
                }

                patternForTagsThatAreAllowed += patternVersion1 + "|" + patternVersion2 + "|" + patternVersion3;
            }
        }

        // Get all html tags included in the string
        Regex regexHtmlTag = new Regex(patternThatGetsAllTags);

        if (!string.IsNullOrEmpty(patternForTagsThatAreAllowed))
        {
            MatchCollection allTagsThatMatched = regexHtmlTag.Matches(htmlString);

            foreach (Match theTag in allTagsThatMatched)
            {
                Regex regOfAllowedTag = new Regex(patternForTagsThatAreAllowed);
                Match matchOfTag = regOfAllowedTag.Match(theTag.Value);

                if (!matchOfTag.Success)
                {
                    // if not allowed replace it with nothing
                    htmlString = htmlString.Replace(theTag.Value, string.Empty);
                }
            }
        }
        else
        {
            // else strip out all tags
            htmlString = regexHtmlTag.Replace(htmlString, string.Empty);
        }

        return htmlString;
    }

Answer 3

使用Regex替换标记* <.*?> *和HttpUtility类以解码实体。

Answer 4

I enclose the complete code:

Striping tags.

public static string StripTags(string source)
{
  return Regex.Replace(source, "<.*?>", string.Empty);
}

Decoding entities.

public static string DecodeHtmlEntities(string text)
{
    return HttpUtility.HtmlDecode(text);
}

c# strip html tags, decode entities

Question

4 answers

solution1
5 2012-06-10 18:45:42

solution2
2 2015-03-25 16:47:39

solution3
1 ACCPTED 2012-10-29 19:41:26

solution4
0 2012-08-17 06:39:14

c# strip html tags, decode entities

Question

4 answers

solution1 5 2012-06-10 18:45:42

solution2 2 2015-03-25 16:47:39

solution3 1 ACCPTED 2012-10-29 19:41:26

solution4 0 2012-08-17 06:39:14

solution1
5 2012-06-10 18:45:42

solution2
2 2015-03-25 16:47:39

solution3
1 ACCPTED 2012-10-29 19:41:26

solution4
0 2012-08-17 06:39:14