简体   繁体   中英

Regex to remove HTML attributes and tags except allowed

I need to validate input text with HTML tags by specific rules.

        string result = string.Empty;
        string acceptableTags = "h1|h2|h3|h4|h5|h6|br|img|video|cut|a";
        string acceptableAtributes = "alt|href|height|width|align|valign|src|class|id|name|title";
        string stringPattern = @"</?(?(?=" + acceptableTags + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>";
        result = Regex.Replace(msg, stringPattern, "");
        stringPattern = @"\s(?!(" + acceptableAtributes + @"))\w+(\s*=\s*[""|']?[/.,#?\w\s:;-]+[""|']?)";
        result = Regex.Replace(result, stringPattern, "");
        return result;

This is almost working code. For example, it will remove onload attribute here

<img src="pic.jpg" onload=" alert(123)">

but will not here

<img src="pic.jpg"onload="alert(123)">

PS It will be better to have a one regex for this, but I do not know it very well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM