Remove characters between top level html tags

Question

In summary, I am looking for a bulletproof solution to remove \\n's from between HTML tags to make well-formed HTML instead of the quirks-mode string I am receiving.

Longer explanation: I have a string that contains HTML. There are \\n strings between some of the top-level tags that I need to remove, BUT I must not remove \\n's from inside tag content.

Example:

<p class='A'>AA A AAA</p>\n   \n  \n <p class='B'>BB BB \n BB\nBBB BB</p>

The \\n's between the paras need to go, but the \\n's in the para with class=B must stay. This is a trumped up example - in the real world there are no predefined classes etc, I just get para tags with unpredictable content.

What did I try:

Simple string replacement is out because, of course, it hits the \\n's in the second para element which must be retained.
I have looked for a regexp solution but can't grok how to make them work selectively as is required. Even though regex is clever I think it still sees a 'stream' rather than a 'structure'
I tried loading the HTML into a div and pulling back that div's HTML hoping that it would 'clean up' the intertag \\n's but not so.

Here is my current solution using jquery to do the clean up. This only works for me becuse I know that there is no text that I want to keep inbetween the top level tags. Also it cannot be made recursive to clean the grandchildren or lower because any text wold be lost.

 var dIn = $('#in'); // div to act as container to load subject html var dOut = $('#out'); // div to act as container for cleaing op var sOut=''; // string to accumulate output var sIn = "<p class='A'>AA A\\n AAA</p>\\n \\n \\n <p class='B'>BB BB \\n BB\\nBBB BB<span>CC\\nC</p>"; $('#t1').val(sIn); // display starting string dIn.html(sIn); // load input string into a div element dIn.children().each(function(){ // walk the children of the container dOut.append($(this)); // append each child of input container to output container sOut = sOut + dOut.html(); // and yank the output containers html to give the tag-only content dOut.html(''); // last clear the output container for the next pass }) // show the results $('#t2').val(sOut);

 <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <div id="in"></div> <div id="out"></div> <div id="info"> <textarea id='t1' rows='10' cols='40'> </textarea> <textarea id='t2' rows='10' cols='40'> </textarea> </div>

Note: in case the comment is lost, this post explains why regex will not work. Props to @melpomene .

Answer 1

Regular Expressions are tricky in dealing with HTML documents as elements can nest in each other makes you be aware of different things leading in complexity, leaving you in a terrible situation and providing a nasty buggy workaround that all to me means headache.

Use a parser instead. A DOM parser actually not a regex-based parser. Below DOM solution works on first level nodes that at this point differs from RegEx solution.

DOM solution:

 var html = `<p class='A'>AA A AAA</p> <p class='B' test required >BB BB BB BBB BB</p>` var parser = new DOMParser(); var doc = parser.parseFromString(html, "text/html"); // Only immediate children of body var query = doc.evaluate('//body/*/following-sibling::text()', doc, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null ); for (let i = 0, length = query.snapshotLength; i < length; i++) { query.snapshotItem(i).textContent = query.snapshotItem(i).textContent.replace(/\\n/g, ""); } console.log(doc.body.innerHTML);

RegEx solution (not preferred - it looks for closing tags and opening tags which reside beside each other respectively):

 var html = `<p class='A'>AA A AAA</p> <p class='B' test required >BB BB BB BBB BB</p>` console.log(html.replace(/(<\\/\\w+>)([^<>]+)(<\\w+(?:\\s+[\\w-]+(?:\\s*=\\s*(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'))?)*\\s*>)/g, function(match, $1, $2, $3) { return $1 + $2.replace(/\\n/g, '') + $3; }));

Remove characters between top level html tags

Question

1 answers

solution1
1 ACCPTED 2018-02-10 13:33:08

Remove characters between top level html tags

Question

1 answers

solution1 1 ACCPTED 2018-02-10 13:33:08

solution1
1 ACCPTED 2018-02-10 13:33:08