简体   繁体   中英

How do I remove all <br /> tags that aren't within <p> tags using regex?

I'm still getting used to using regex so I'm not entirely sure how to make this work.

I am not using jQuery and it is not the current document , rather I'm getting html from another source as a string . I don't care about the <br /> tags that are outside of <p> tags, so I'd like to parse those out. I want to keep the ones that are within <p> tags to preserve their line breaks.

I need to change something like this:

<body><br /><p>hello<br />there</p><br /></body>

To this:

<body><p>hello<br />there</p></body>

What regex would I use to make this work?

Edit: More information, I'm trying to do this server side with Node.js. Because of that, I do not have access to DOMParser , I am, however, using html-dom-parser . I'm parsing out these outer
tags before I pass it to that parser to reduce the resultant DOM tree object.

You can use DOMPArser to parse the HTML content and then use :not() pseudo-class selector to get all tags which are not p tag and then use > (direct child selector) to get br tags which is the direct child of it(to avoid nested).

 let html = `<body><br /> <p>hello<br />there </p><br /></body>`; let parser = new DOMParser(); doc = parser.parseFromString(html, "text/html"); doc.querySelectorAll(':not(p) > br').forEach(ele => ele.remove()) console.log(doc.body.outerHTML) 

Parsing HTML using RegExp is a bad idea:

Using regular expressions to parse HTML: why not?

RegEx match open tags except XHTML self-contained tags


For Node.js using jsdom library, it may look alike,

let html = `<body><br />
  <p>hello<br />there</p><br /></body>`;

const dom = new JSDOM(html);


dom.window.document.querySelectorAll(':not(p) > br').forEach(ele => ele.remove())

console.log(dom.window.document.body.outerHTML)

UPDATE : If there is a chance for nested br tag within the p tag then check the ancestor element before removing.

For eg :

 let html = `<body><br /> <p>hello<br />there<span><br/></span> </p><br /></body>`; let parser = new DOMParser(); doc = parser.parseFromString(html, "text/html"); doc.querySelectorAll(':not(p) > br').forEach(ele => { // check for any p tag in parent level if (!ele.closest('p')) ele.remove() }) console.log(doc.body.outerHTML) 

Based on the answer of Pranav C Balan :

The code <...>.querySelectorAll(':not(p) > br').forEach(ele => ele.remove()) is dangerous , because it would remove all the <br> in <p> , when the former are themselves nested in non- <p> tags.

 let html = `<body><br> <p>hello <u>underline<br>underline</u><br>there </p><br></body>`; let parser = new DOMParser(); doc = parser.parseFromString(html, "text/html"); doc.querySelectorAll(':not(p) > br').forEach(ele => ele.remove()) console.log(doc.body.outerHTML) console.log(`This should've been: <body> <p>hello <u>underline<br>underline</u><br>there </p></body>`) 

To make it work, we need to get all the <br> elements and examine, if they are inside a <p> element, be it as a direct descendant or not. With jQuery you would use the closest method. We can use a VanillaJS method like this as described here: PlainJS - Get closes element by selector

 /** source: https://plainjs.com/javascript/traversing/get-closest-element-by-selector-39/ */ // matches polyfill this.Element && function(ElementPrototype) { ElementPrototype.matches = ElementPrototype.matches || ElementPrototype.matchesSelector || ElementPrototype.webkitMatchesSelector || ElementPrototype.msMatchesSelector || function(selector) { var node = this, nodes = (node.parentNode || node.document).querySelectorAll(selector), i = -1; while (nodes[++i] && nodes[i] != node); return !!nodes[i]; } }(Element.prototype); // closest polyfill this.Element && function(ElementPrototype) { ElementPrototype.closest = ElementPrototype.closest || function(selector) { var el = this; while (el.matches && !el.matches(selector)) el = el.parentNode; return el.matches ? el : null; } }(Element.prototype); let html = `<body><br> <p>hello <u>underline<br>underline</u><br>there </p><br></body>`; let parser = new DOMParser(); doc = parser.parseFromString(html, "text/html"); doc.querySelectorAll(':not(p) > br').forEach(ele => { if (!ele.closest('p')) { ele.remove() } }) console.log(doc.body.outerHTML) console.log(`That should be: <body> <p>hello <u>underline<br>underline</u><br>there </p></body>`) 

Addendum:

If you need to put spaces at the position where the removed <br> were, to prevent converting a<br>b to ab but rather ab , you can use this function inside the forEach

elm => {
    if (!elm.closest('p')) {
        elm.parentNode.insertBefore(document.createTextNode(' '), elm);
        elm.remove();
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM