I'm trying to create a regex that could whitelist a few set of html tags.
/<(\/)?(code|em|ul)(\/)?>$/
But there are few cases where this is failing:
<em style="padding: 10px">
So tried /<(\\/)?(code|em|ul)(.|\\n)*?(\\/)?>$/
but this also allows
<emadchgasgh style="padding: 10px">
Cases that need to be whitelisted:
<em> - Success
</em> - Success
<br/> - Success
<em style="asdcasc"> - Success
<emacjhasjdhc> - Failure
Question- What else could be added to the regex?
/<\s*\/?\s*(code|em|ul|br)\b.*?>/
\\s*\\/?\\s*
There may be spaces before the name of the tag (code|em|ul|br)\\b
Matches only the whole tag name .*?>
Matching everything to the character >
On client-side, parse the text into a document with DOMParser and use querySelector
to select an element which is not code
, em
ul
, or br
with the query string:
*:not(code):not(em):not(ul):not(br)
If anything is returned, the string does not pass.
const test = (str) => { const doc = new DOMParser().parseFromString(str, 'text/html'); return !doc.body.querySelector('*:not(code):not(em):not(ul):not(br)'); }; console.log(test('foo <br> bar')); console.log(test('foo <code>code here</code> bar <br>')); console.log(test('foo <div>not allowed</div>'));
In Java, you can use Jsoup
to parse a given HTML string, and then you can select elements inside it, eg:
Document doc = Jsoup.parse(input);
Elements forbiddenElements = doc.select("*:not(code):not(em):not(ul):not(br)");
If forbiddenElements
has anything in it, the string contains forbidden elements.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.