I'm still getting used to using regex so I'm not entirely sure how to make this work.
I am not using jQuery
and it is not the current document
, rather I'm getting html from another source as a string
. I don't care about the <br />
tags that are outside of <p>
tags, so I'd like to parse those out. I want to keep the ones that are within <p>
tags to preserve their line breaks.
I need to change something like this:
<body><br /><p>hello<br />there</p><br /></body>
To this:
<body><p>hello<br />there</p></body>
What regex would I use to make this work?
Edit: More information, I'm trying to do this server side with Node.js. Because of that, I do not have access to DOMParser
, I am, however, using html-dom-parser
. I'm parsing out these outer
tags before I pass it to that parser to reduce the resultant DOM tree object.
You can use DOMPArser to parse the HTML content and then use :not()
pseudo-class selector to get all tags which are not p
tag and then use >
(direct child selector) to get br
tags which is the direct child of it(to avoid nested).
let html = `<body><br /> <p>hello<br />there </p><br /></body>`; let parser = new DOMParser(); doc = parser.parseFromString(html, "text/html"); doc.querySelectorAll(':not(p) > br').forEach(ele => ele.remove()) console.log(doc.body.outerHTML)
Parsing HTML using RegExp is a bad idea:
Using regular expressions to parse HTML: why not?
RegEx match open tags except XHTML self-contained tags
For Node.js using jsdom library, it may look alike,
let html = `<body><br />
<p>hello<br />there</p><br /></body>`;
const dom = new JSDOM(html);
dom.window.document.querySelectorAll(':not(p) > br').forEach(ele => ele.remove())
console.log(dom.window.document.body.outerHTML)
UPDATE : If there is a chance for nested br
tag within the p tag then check the ancestor element before removing.
For eg :
let html = `<body><br /> <p>hello<br />there<span><br/></span> </p><br /></body>`; let parser = new DOMParser(); doc = parser.parseFromString(html, "text/html"); doc.querySelectorAll(':not(p) > br').forEach(ele => { // check for any p tag in parent level if (!ele.closest('p')) ele.remove() }) console.log(doc.body.outerHTML)
Based on the answer of Pranav C Balan :
The code <...>.querySelectorAll(':not(p) > br').forEach(ele => ele.remove())
is dangerous , because it would remove all the <br>
in <p>
, when the former are themselves nested in non- <p>
tags.
let html = `<body><br> <p>hello <u>underline<br>underline</u><br>there </p><br></body>`; let parser = new DOMParser(); doc = parser.parseFromString(html, "text/html"); doc.querySelectorAll(':not(p) > br').forEach(ele => ele.remove()) console.log(doc.body.outerHTML) console.log(`This should've been: <body> <p>hello <u>underline<br>underline</u><br>there </p></body>`)
To make it work, we need to get all the <br>
elements and examine, if they are inside a <p>
element, be it as a direct descendant or not. With jQuery you would use the closest
method. We can use a VanillaJS method like this as described here: PlainJS - Get closes element by selector
/** source: https://plainjs.com/javascript/traversing/get-closest-element-by-selector-39/ */ // matches polyfill this.Element && function(ElementPrototype) { ElementPrototype.matches = ElementPrototype.matches || ElementPrototype.matchesSelector || ElementPrototype.webkitMatchesSelector || ElementPrototype.msMatchesSelector || function(selector) { var node = this, nodes = (node.parentNode || node.document).querySelectorAll(selector), i = -1; while (nodes[++i] && nodes[i] != node); return !!nodes[i]; } }(Element.prototype); // closest polyfill this.Element && function(ElementPrototype) { ElementPrototype.closest = ElementPrototype.closest || function(selector) { var el = this; while (el.matches && !el.matches(selector)) el = el.parentNode; return el.matches ? el : null; } }(Element.prototype); let html = `<body><br> <p>hello <u>underline<br>underline</u><br>there </p><br></body>`; let parser = new DOMParser(); doc = parser.parseFromString(html, "text/html"); doc.querySelectorAll(':not(p) > br').forEach(ele => { if (!ele.closest('p')) { ele.remove() } }) console.log(doc.body.outerHTML) console.log(`That should be: <body> <p>hello <u>underline<br>underline</u><br>there </p></body>`)
If you need to put spaces at the position where the removed <br>
were, to prevent converting a<br>b
to ab
but rather ab
, you can use this function inside the forEach
elm => {
if (!elm.closest('p')) {
elm.parentNode.insertBefore(document.createTextNode(' '), elm);
elm.remove();
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.