简体   繁体   中英

How can I remove all HTML tags in a string except for the only 's' tag?

I'm trying to remove all html tags except only <s></s> tags. Right now I have:

contents.replace(/(<([^>]+)>)/gi, '')

This remove all html tags.

So...

i tried many other solutions.

<\/?(?!s)\w*\b[^>]*> . <(?.s|/s)?*?> .....

However these regex remove all tags containing the letter 's'.

For example, <strong> <span> and so on.

I'd really appreciate it if you could help me.

Whether or not this is possible depends on how accurate you want to be. Regex cannot be used to 100% accurately parse HTML.

But if you just want something quick and dirty:

You can take advantage of the fact that String.prototype.replace allows you to differentiate between capture groups: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#specifying_a_function_as_the_replacement

So you can make two capture groups:

Group 1 ( <s> or </s> ): <\/?s>

Group 2: ( "starts with < , ends with > , and has no > between" ): (<[^>]*>)

Then when calling string.replace return the match if it matches group 1, else it has only matched group 2, so return an empty string:

 function removeTags(text) { const regex = /(<\/?s>)|(<[^>]*>)/g; // Group 1 OR Group 2 return text.replace(regex, (_, g1) => g1 || ''); } let text = '<span>Span Text <s>S Text <strong>Strong Text</strong></s></span>'; console.log(removeTags(text));


Note the flaw: if < and > exist as text, everything in between may be considered a tag when it is not:

 function removeTags(text) { const regex = /(<\/?s>)|(<[^>]*>)/g; // Group 1 OR Group 2 return text.replace(regex, (_, g1) => g1 || ''); } let text = '<p> This is how you start a tag: `<` and this is how you end a tag: `>`</p>'; console.log("But the regex fails:"); console.log(removeTags(text));
 XML parsers can see that the brackets do not create a tag: <p> This is how you start a tag: `<` and this is how you end a tag: `>`</p>

If you want accurate parsing, use an XML parser.

You could try: /(<([^>s]+)>)|(<\/?(\w{2,})>)/gmi

The first part (<([^>s]+)>) will capture all html tags, except tag contain letter s .

The second part (<\/?(\w{2,})>) will capture all html tags which have 2 letters or more.

Demo: https://regex101.com/r/AFlXam/1

You can't reliably parse HTML with regex, see RegEx match open tags except XHTML self-contained tags

You can use regex with some limitation to solve your quest of stripping HTML except for s tags. This builds upon Chris Hamilton's answer, but avoids falls positives ( a <= 20 && a > 2 ) because it is aware of tag and attributes:

 function removeTags(text) { const regex = /(<\/?s>)|<\/?[a-zA-Z][a-zA-Z0-9]*(?: .*?)?>/g; return text.replace(regex, (_, g1) => g1 || ''); } const text = '<h1>Demo:</h1> <p>Paragraph with <s>S text</s>, <b>bold stuff.</b></p> <p style="color: gray">Condition: <tt>(a <= 20 && a > 2)</tt></p>'; console.log(removeTags(text));

Output:

Demo: Paragraph with <s>S text</s>, bold stuff. Condition: (a <= 20 && a > 2)

Explanation of regex:

  • (<\/?s>) -- literal <s> or </s>
  • | -- logical or
  • <\/?[a-zA-Z][a-zA-Z0-9]* -- start of tag, such as <h1 or '<p'
  • (?: .*?)? -- optional non-capture group starting with a space, and a non-greedy scan
  • > -- literal >

Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM