In summary, I am looking for a bulletproof solution to remove \\n's from between HTML tags to make well-formed HTML instead of the quirks-mode string I am receiving.
Longer explanation: I have a string that contains HTML. There are \\n strings between some of the top-level tags that I need to remove, BUT I must not remove \\n's from inside tag content.
Example:
<p class='A'>AA A AAA</p>\n \n \n <p class='B'>BB BB \n BB\nBBB BB</p>
The \\n's between the paras need to go, but the \\n's in the para with class=B must stay. This is a trumped up example - in the real world there are no predefined classes etc, I just get para tags with unpredictable content.
What did I try:
Here is my current solution using jquery to do the clean up. This only works for me becuse I know that there is no text that I want to keep inbetween the top level tags. Also it cannot be made recursive to clean the grandchildren or lower because any text wold be lost.
var dIn = $('#in'); // div to act as container to load subject html var dOut = $('#out'); // div to act as container for cleaing op var sOut=''; // string to accumulate output var sIn = "<p class='A'>AA A\\n AAA</p>\\n \\n \\n <p class='B'>BB BB \\n BB\\nBBB BB<span>CC\\nC</p>"; $('#t1').val(sIn); // display starting string dIn.html(sIn); // load input string into a div element dIn.children().each(function(){ // walk the children of the container dOut.append($(this)); // append each child of input container to output container sOut = sOut + dOut.html(); // and yank the output containers html to give the tag-only content dOut.html(''); // last clear the output container for the next pass }) // show the results $('#t2').val(sOut);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <div id="in"></div> <div id="out"></div> <div id="info"> <textarea id='t1' rows='10' cols='40'> </textarea> <textarea id='t2' rows='10' cols='40'> </textarea> </div>
Note: in case the comment is lost, this post explains why regex will not work. Props to @melpomene .
Regular Expressions are tricky in dealing with HTML documents as elements can nest in each other makes you be aware of different things leading in complexity, leaving you in a terrible situation and providing a nasty buggy workaround that all to me means headache.
Use a parser instead. A DOM parser actually not a regex-based parser. Below DOM solution works on first level nodes that at this point differs from RegEx solution.
DOM solution:
var html = `<p class='A'>AA A AAA</p> <p class='B' test required >BB BB BB BBB BB</p>` var parser = new DOMParser(); var doc = parser.parseFromString(html, "text/html"); // Only immediate children of body var query = doc.evaluate('//body/*/following-sibling::text()', doc, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null ); for (let i = 0, length = query.snapshotLength; i < length; i++) { query.snapshotItem(i).textContent = query.snapshotItem(i).textContent.replace(/\\n/g, ""); } console.log(doc.body.innerHTML);
RegEx solution (not preferred - it looks for closing tags and opening tags which reside beside each other respectively):
var html = `<p class='A'>AA A AAA</p> <p class='B' test required >BB BB BB BBB BB</p>` console.log(html.replace(/(<\\/\\w+>)([^<>]+)(<\\w+(?:\\s+[\\w-]+(?:\\s*=\\s*(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'))?)*\\s*>)/g, function(match, $1, $2, $3) { return $1 + $2.replace(/\\n/g, '') + $3; }));
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.