How can I remove all HTML tags in a string except for the only 's' tag?

Question

I'm trying to remove all html tags except only <s></s> tags. Right now I have:

contents.replace(/(<([^>]+)>)/gi, '')

This remove all html tags.

So...

i tried many other solutions.

<\/?(?!s)\w*\b[^>]*> . <(?.s|/s)?*?> .....

However these regex remove all tags containing the letter 's'.

For example, <strong> <span> and so on.

I'd really appreciate it if you could help me.

Answer 1

Whether or not this is possible depends on how accurate you want to be. Regex cannot be used to 100% accurately parse HTML.

But if you just want something quick and dirty:

You can take advantage of the fact that String.prototype.replace allows you to differentiate between capture groups: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#specifying_a_function_as_the_replacement

So you can make two capture groups:

Group 1 ( <s> or </s> ): <\/?s>

Group 2: ( "starts with < , ends with > , and has no > between" ): (<[^>]*>)

Then when calling string.replace return the match if it matches group 1, else it has only matched group 2, so return an empty string:

 function removeTags(text) { const regex = /(<\/?s>)|(<[^>]*>)/g; // Group 1 OR Group 2 return text.replace(regex, (_, g1) => g1 || ''); } let text = '<span>Span Text <s>S Text <strong>Strong Text</strong></s></span>'; console.log(removeTags(text));

Note the flaw: if < and > exist as text, everything in between may be considered a tag when it is not:

 function removeTags(text) { const regex = /(<\/?s>)|(<[^>]*>)/g; // Group 1 OR Group 2 return text.replace(regex, (_, g1) => g1 || ''); } let text = '<p> This is how you start a tag: `<` and this is how you end a tag: `>`</p>'; console.log("But the regex fails:"); console.log(removeTags(text));

 XML parsers can see that the brackets do not create a tag: <p> This is how you start a tag: `<` and this is how you end a tag: `>`</p>

If you want accurate parsing, use an XML parser.

Answer 2

You could try: /(<([^>s]+)>)|(<\/?(\w{2,})>)/gmi

The first part (<([^>s]+)>) will capture all html tags, except tag contain letter s .

The second part (<\/?(\w{2,})>) will capture all html tags which have 2 letters or more.

Demo: https://regex101.com/r/AFlXam/1

Answer 3

You can't reliably parse HTML with regex, see RegEx match open tags except XHTML self-contained tags

You can use regex with some limitation to solve your quest of stripping HTML except for s tags. This builds upon Chris Hamilton's answer, but avoids falls positives ( a <= 20 && a > 2 ) because it is aware of tag and attributes:

 function removeTags(text) { const regex = /(<\/?s>)|<\/?[a-zA-Z][a-zA-Z0-9]*(?: .*?)?>/g; return text.replace(regex, (_, g1) => g1 || ''); } const text = '<h1>Demo:</h1> <p>Paragraph with <s>S text</s>, <b>bold stuff.</b></p> <p style="color: gray">Condition: <tt>(a <= 20 && a > 2)</tt></p>'; console.log(removeTags(text));

Output:

Demo: Paragraph with <s>S text</s>, bold stuff. Condition: (a <= 20 && a > 2)

Explanation of regex:

(<\/?s>) -- literal <s> or </s>
| -- logical or
<\/?[a-zA-Z][a-zA-Z0-9]* -- start of tag, such as <h1 or '<p'
(?: .*?)? -- optional non-capture group starting with a space, and a non-greedy scan
> -- literal >

Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

How can I remove all HTML tags in a string except for the only 's' tag?

Question

3 answers

solution1
1 ACCPTED 2023-01-27 06:27:38

solution2
0 2023-01-27 09:48:42

solution3
0 2023-01-27 20:57:31

How can I remove all HTML tags in a string except for the only 's' tag?

Question

3 answers

solution1 1 ACCPTED 2023-01-27 06:27:38

solution2 0 2023-01-27 09:48:42

solution3 0 2023-01-27 20:57:31

solution1
1 ACCPTED 2023-01-27 06:27:38

solution2
0 2023-01-27 09:48:42

solution3
0 2023-01-27 20:57:31