简体   繁体   中英

Regex to whitelist html tags

I'm trying to create a regex that could whitelist a few set of html tags.

/<(\/)?(code|em|ul)(\/)?>$/

But there are few cases where this is failing:

<em style="padding: 10px">

So tried /<(\\/)?(code|em|ul)(.|\\n)*?(\\/)?>$/ but this also allows

<emadchgasgh style="padding: 10px">

Cases that need to be whitelisted:

<em> - Success
</em> - Success
<br/> - Success
<em style="asdcasc"> - Success
<emacjhasjdhc> - Failure

Question- What else could be added to the regex?

/<\s*\/?\s*(code|em|ul|br)\b.*?>/

\\s*\\/?\\s* There may be spaces before the name of the tag
(code|em|ul|br)\\b Matches only the whole tag name
.*?> Matching everything to the character >

On client-side, parse the text into a document with DOMParser and use querySelector to select an element which is not code , em ul , or br with the query string:

*:not(code):not(em):not(ul):not(br)

If anything is returned, the string does not pass.

 const test = (str) => { const doc = new DOMParser().parseFromString(str, 'text/html'); return !doc.body.querySelector('*:not(code):not(em):not(ul):not(br)'); }; console.log(test('foo <br> bar')); console.log(test('foo <code>code here</code> bar <br>')); console.log(test('foo <div>not allowed</div>'));

In Java, you can use Jsoup to parse a given HTML string, and then you can select elements inside it, eg:

Document doc = Jsoup.parse(input);
Elements forbiddenElements = doc.select("*:not(code):not(em):not(ul):not(br)");

If forbiddenElements has anything in it, the string contains forbidden elements.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM