简体   繁体   中英

Remove empty tags using RegEx

I want to delete empty tags such as <label></label> , <font> </font> so that:

<label></label><form></form>
<p>This is <span style="color: red;">red</span> 
<i>italic</i>
</p>

will be cleaned as:

<p>This is <span style="color: red;">red</span> 
<i>italic</i>
</p>

I have this RegEx in javascript, but it deletes the the empty tags but it also delete this: "<i>italic</i></p>"

str=str.replace(/<[\S]+><\/[\S]+>/gim, "");

What I am missing?

Regex is not for HTML. If you're in JavaScript anyway I'd be encouraged to use jQuery DOM processing.

Something like:

$('*:empty').remove();

Alternatively:

$("*").filter(function() 
{ 
     return $.trim($(this).html()).length > 0; 
}).remove();

You have "not spaces" as your character class, which means " <i>italic</i></p> " will match. The first half of your regex will match " <(i>italic</i)> " and the second half " </(p)> ". (I've used brackets to show what each [\\S]+ matches.)

Change this:

/<[\S]+><\/[\S]+>/

To this:

/<[^/>][^>]*><\/[^>]+>/

Overall you should really be using a proper HTML processor, but if you're munging HTML soup this should suffice :)

All the answers with regex are only validate

<label></label>

but in the case of

<label> </label>
<label>    </label>
<label>
</label> 

try this pattern to get all the above

<[^/>]+>[ \n\r\t]*</[^>]+>

You need /<[\\S]+?><\\/[\\S]+?>/ -- the difference is the ? s after the + s, to match "as few as possible" (AKA "non-greedy match") nonspace characters (though 1 or more), instead of the bare + s which match"as many as possible" (AKA "greedy match").

Avoiding regular expressions altogether, as the other answer recommends, is also an excellent idea, but I wanted to point out the important greedy vs non-greedy distinction, which will serve you well in a huge variety of situations where regexes are warranted.

I like MattMitchell's jQuery solution but here is another option using native JavaScript.

function CleanChildren(elem)
{
    var children = elem.childNodes;
    var len = elem.childNodes.length;

    for (var i = 0; i < len; i++)
    {
        var child = children[i];

        if(child.hasChildNodes())
            CleanChildren(child);
        else
            elem.removeChildNode(child);

    }
}

Here's a modern native JavaScript solution; which is actually quite similar to the jQuery one from 2010. I adapted it from that answer for a project that I am working on, and thought I would share it here.

document.querySelectorAll("*:empty").forEach((x)=>{x.remove()});
  • document.querySelectorAll returns a NodeList ; which is essentially an array of all DOM nodes which match the CSS selector given to it as an argument.

    • *:empty is a selector which selects all elements ( * means "any element") that is empty (which is what :empty means).

      This will select any empty element within the entire document , if you only wanted to remove any empty elements from within a certain part of the page (ie only those within some div element); you can add an id to that element and then use the selector #id *:empty , which means any empty element within the element with an id of id .

      This is almost certainly what you want. Technically some important tags (eg <meta> tags, <br> tags, <img> tags, etc) are "empty"; so without specifying a scope, you will end up deleting some tags you probably care about.

  • forEach loops through every element in the resulting NodeList , and runs the anonymous function (x)=>{x.remove()} on it. x is the current element in the list, and calling .remove() on it removes that element from the DOM.

Hopefully this helps someone. It's amazing to see how far JavaScript has come in just 8 years; from almost always needing a library to write something complex like this in a concise manner to being able to do so natively.

Edit

So, the method detailed above will work fine in most circumstances, but it has two issues:

  • Elements like <div> </div> are not treated as :empty (not the space in-between). CSS Level 4 selectors fix this with the introduction of the :blank selector (which is like empty except it ignores whitespace), but currently only Firefox supports it (in vendor-prefixed form).
  • Self-closing tags are caught by :empty - and this will remain the case with :blank , too.

I have written a slightly larger function which deals with these two use cases:

document.querySelectorAll("*").forEach((x)=>{
    let tagName = "</" + x.tagName + ">";
    if (x.outerHTML.slice(tagName.length).toUpperCase() == tagName
        && /[^\s]/.test(x.innerHTML)) {
        x.remove();
    }
});

We iterate through every element on the page. We grab that element's tag name (for example, if the element is a div this would be DIV , and use it to construct a closing tag - eg </DIV> .

That tag is 6 characters long. We check if the upper-cased last 6 characters of the elements HTML matches that. If it does we continue. If it doesn't, the element does't have a closing tag, and therefore must be self-closing. This is preferable over a list, because it means you don't have to update anything should a new self-closing tag get added to the spec.

Then, we check if the contents of the element contain any whitespace. /[^\\s]/ is a RegEx. [] is a set in RegEx, and will match any character that appears inside it. If ^ is the first element, the set becomes negated - it will match any element that is NOT in the set. \\s means whitespace - tabs, spaces, line breaks. So what [^\\s] says is "any character that is not white space".

Matching against that, if the tag is not self-closing, and its contents contain a non-whitespace character, then we remove it.


Of course, this is a bit bigger and less elegant than the previous one-liner. But it should work for essentially every case.

This is an issue of greedy regex. Try this:

str=str.replace(/<[\\^>]+><\\/[\\S]+>/gim, "");

or

str=str.replace(/<[\\S]+?><\\/[\\S]+>/gim, "");

In your regex, <[\\S]+?> matches <i>italic</i> and the <\\/[\\S]+> matches the </p>

你可以用这个text = text.replace(/<[^/>][^>] >\\s </[^>]+>/gim, "");

found this on code pen: jQuery though but does the job

$('element').each(function() {
  if ($(this).text() === '') {
    $(this).remove();
  }
});

You will need to alter the element to point to where you want to remove empty tags. Do not point at document cause it will result in my answer at Toastrackenigma

remove empty tags with cheerio will and also removing images:

  $('*')
    .filter(function(index, el) {
      return (
        $(el)
          .text()
          .trim().length === 0
      )
    })
    .remove()

remove empty tags with cheerio, but also keep images:

  $('*')
    .filter(function(index, el) {
      return (
        el.tagName !== 'img' &&
        $(el).find(`img`).length === 0 &&
        $(el)
          .text()
          .trim().length === 0
      )
    })
    .remove()
<([^>]+)\s*>\s*<\/\1\s*>
<div>asdf</div>
<div></div> -- will match only this
<div></notdiv>
-- and this
<div  >  
    </div   >

try yourself https://regexr.com/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM