简体   繁体   中英

Capturing everything between <pre> </pre> tags

I am reading in a .html file:

const htmlin = String(fs.readFileSync(inputHtml) || '');

const splitted = htmlin.split(/<pre.*>/);
splitted.shift();

const justPost = splitted.join('').split('</pre>');
justPost.pop();

but I am looking for a way to match all the text within

aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc

and also match the text outside. So that I can get two arrays:

['aaa ', ' bbb ', ' ccc']

and

[' xxx ', ' foo ']

how can I do this with regex or some other method?

One way is to use regex replace function and capturing group.

<pre>(.*?)(?=<\/pre>)|(?:^|<\/pre>)(.*?)(?=$|<pre>)
  • <pre>(.*?)(?=<\\/pre>) - Matches text between pre tags. ( g1 )
  • (?:^|<\\/pre>)(.*?)(?=$|<pre>) - Matches text out of pre tags. (g2)

 let str = `aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc` let inner = [] let outer = [] let op = str.replace(/<pre>(.*?)(?=<\\/pre>)|(?:^|<\\/pre>)(.*?)(?=$|<pre>)/g, function (match,g1,g2){ if(g1){ inner.push(g1.trim()) } if(g2){ outer.push(g2.trim()) } return match }) console.log(outer) console.log(inner) 

Instead of using a regex, you might use the dom or a domparser.

For example, create a div and set the innerHTML property to your html. Then loop the childnodes and get the innerHTML or the textContent.

For example:

 let htmlString = `aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc`, pre = [], text = []; let div = document.createElement('div'); div.innerHTML = htmlString; div.childNodes.forEach(x => { if (x.nodeType === Node.TEXT_NODE) { text.push(x.textContent.trim()) } if (x.nodeName === "PRE") { pre.push(x.innerHTML.trim()); } }); console.log(pre); console.log(text); 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM