Javascript regex to remove unmatched closing HTML tags?

Question

I'm trying to remove the excessive closing tags in javascript and anything that follows after that.

Here is a possible sample:

<div class="dummy">
    <div class="main">
        <div></div>
        <img src="a.jpg">
        <br>
        <img src="b.jpg />
        <strong>
            <span>text</span>
        </strong>
    </div>
</div>
    ***excessive tags below***
</div>
</div>
<div class="footer">
    text
</div>
</body>
</html>

Any ideas about how to do it efficiently? The part I want to extract is always a div, but the problem is that it may have as many nested divs, and I'm not sure how to deal with this scenario.

If it can be done in multiple steps or with callbacks is also fine, as long as it works.

Edit My question is actually easier than it seems. The sample always starts with the div that I want to extract. So all I need is to find the matching closing tag, and filter anything that follows. Don't care about any other tags...

Answer 1

Don't use regex, from my understanding, you want to retain the dummy class div and the footer class div so why not replace the body with that?

Eg

var dummy = document.getElementsByClassName('dummy')[0];
var footer = document.getElementsByClassName('footer')[0]

var body = document.getElementsByTagName('body')[0];
body.innerHTML = '';
body.appendChild(dummy);
body.appendChild(footer);

https://jsfiddle.net/1kq11ry2/

Answer 2

data='<div class="dummy"><div class="main"><div></div><img src="a.jpg"><br><div></div><img src="b.jpg /><strong><span>text</span> </strong></div><div><div></div></div><div><div></div></div></div>***excessive tags below***</div></div><div class="footer">text</div></body></html>';



var starting_tags = [];
var closing_tags = [];

        var startIndex, index=0;
    var searchStrLen = 4;
    while ((index = data.indexOf('<div', startIndex)) > -1) {
        starting_tags.push(index);
        startIndex = index + searchStrLen;
    }
    index,startIndex=0;
    searchStrLen = 6;
     while ((index = data.indexOf('</div>', startIndex)) > -1) {
        closing_tags.push(index);
        startIndex = index + searchStrLen;
    }

    var nest_level=0;
    for (var i=0; i<closing_tags.length && nest_level<closing_tags.length && nest_level<=closing_tags.length; ++i) {
      for (var j=0+nest_level; j<starting_tags.length; ++j) {
                if (starting_tags[j]<closing_tags[nest_level]) 
            nest_level++;
      }
    }
result = data.substr(startIndex[starting_tags], closing_tags[nest_level-1]+6);    

console.log(nest_level);
console.log(starting_tags);
console.log(closing_tags);
console.log(result);

I was able to solve it. The code above calculates the level of div nesting, and then chops it off if it finds excessive closing tags.

https://jsfiddle.net/89j7yakz/2/

Javascript regex to remove unmatched closing HTML tags?

Question

2 answers

solution1
2 2017-03-10 04:32:57

solution2
0 ACCPTED 2017-03-10 05:19:03

Javascript regex to remove unmatched closing HTML tags?

Question

2 answers

solution1 2 2017-03-10 04:32:57

solution2 0 ACCPTED 2017-03-10 05:19:03

solution1
2 2017-03-10 04:32:57

solution2
0 ACCPTED 2017-03-10 05:19:03