简体   繁体   中英

ASP.net parsing html to make it safe. Is this ok?

I'm sure this has been asked a number of time but I'm having trouble finding something that matches what I want. I want to be able to safely render html in my webpage but only allow links,
and

tags

I've come up with the following but want to make sure i've not miseed anything or if there is a better way please let me know.

Code:

    private string RemoveEvilTags(string value)
    {
        string[] allowed = { "<br/>", "<p>", "</p>", "</a>", "<a href" };
        string anchorPattern = @"<a[\s]+[^>]*?href[\s]?=[\s\""\']+(?<href>.*?)[\""\']+.*?>(?<fileName>[^<]+|.*?‌​)?<\/a>";            
        string safeText = value;

        System.Text.RegularExpressions.MatchCollection matches = Regex.Matches(value, anchorPattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.Compiled);
        if (matches.Count > 0)
        {
            foreach (Match m in matches)
            {
                string url = m.Groups["href"].Value;
                string linkText = m.Groups["fileName"].Value;                    

                Uri testUri = null;
                if (Uri.TryCreate(url, UriKind.Absolute, out testUri) && testUri.AbsoluteUri.StartsWith("http"))
                {
                    safeText = safeText.Replace(m.Groups[0].Value, string.Format("<a href=\"{0}\" >{1}</a>", testUri.AbsoluteUri, linkText));
                }
                else
                {
                    safeText = safeText.Replace(m.Groups[0].Value, linkText);
                }
            }
        }

        //Remove everything.
        safeText = System.Text.RegularExpressions.Regex.Replace(safeText, @"<[a-zA-Z\/][^>]*>", m => m != null && allowed.Contains(m.Value) || m.Value.StartsWith("<a href") ? m.Value : String.Empty);

        //Now add them back in.
        return safeText;
    }

Tests:

    [Test]
    public void EvilTagTest()
    {
        var safeText = RemoveEvilTags("this is a test <p>ok</p>");
        Assert.AreEqual("this is a test <p>ok</p>", safeText);

        safeText = RemoveEvilTags("this is a test <script>ok</script>");
        Assert.AreEqual("this is a test ok", safeText);

        safeText = RemoveEvilTags("this is a test <script><script>ok</script></script>");
        Assert.AreEqual("this is a test ok", safeText);

        //Check relitive link
        safeText = RemoveEvilTags("this is a test <a href=\"bob\" >click here</a>");
        Assert.AreEqual("this is a test click here", safeText);

        //Check full link
        safeText = RemoveEvilTags("this is a test <a href=\"http://test.com/\" >click here</a>");
        Assert.AreEqual("this is a test <a href=\"http://test.com/\" >click here</a>", safeText);

        //Check full link
        safeText = RemoveEvilTags("this is a test <a href=\"https://test.com/\" >click here</a>");
        Assert.AreEqual("this is a test <a href=\"https://test.com/\" >click here</a>", safeText);

        //javascript link
        safeText = RemoveEvilTags("this is a test <a href=\"javascript:evil()\" >click here</a>");
        Assert.AreEqual("this is a test click here", safeText);

        safeText = RemoveEvilTags("this is a test <a href=\"https://test.com/\" ><script>evil();</script>click here</a>");
        Assert.AreEqual("this is a test <a href=\"https://test.com/\" >click here</a>", safeText);
    }

All tests pass but what have i missed?

Thank you.

For best practice you should not be making your own library to "RemoveEvilTags". There are plenty of methods malicious users could use to perform an XSS attack. ASP.NET provides an Anti XSS Library already:

http://msdn.microsoft.com/en-us/library/aa973813.aspx

Since you're using ASP.NET, Plural Sight has a good video on XSS. More focussed towards MVC, however it is still valid in this context.

http://www.pluralsight-training.net/microsoft/players/PSODPlayer?author=scott-allen&name=mvc3-building-security&mode=live&clip=0&course=aspdotnet-mvc3-intro

Instead of writing such code, I would suggest you to use some html parser such as Html Agility Pack .

Your code parsing code may run into a lot un-handled of corner cases - hopefully, parser would handle the most of those cases. Once parsed, you can reject invalid input or allow only valid tags (as per your needs).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM