Capturing everything between <pre> </pre> tags

Question

I am reading in a .html file:

const htmlin = String(fs.readFileSync(inputHtml) || '');

const splitted = htmlin.split(/<pre.*>/);
splitted.shift();

const justPost = splitted.join('').split('</pre>');
justPost.pop();

but I am looking for a way to match all the text within

aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc

and also match the text outside. So that I can get two arrays:

['aaa ', ' bbb ', ' ccc']

and

[' xxx ', ' foo ']

how can I do this with regex or some other method?

Answer 1

One way is to use regex replace function and capturing group.

<pre>(.*?)(?=<\/pre>)|(?:^|<\/pre>)(.*?)(?=$|<pre>)

<pre>(.*?)(?=<\\/pre>) - Matches text between pre tags. ( g1 )
(?:^|<\\/pre>)(.*?)(?=$|<pre>) - Matches text out of pre tags. (g2)

 let str = `aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc` let inner = [] let outer = [] let op = str.replace(/<pre>(.*?)(?=<\\/pre>)|(?:^|<\\/pre>)(.*?)(?=$|<pre>)/g, function (match,g1,g2){ if(g1){ inner.push(g1.trim()) } if(g2){ outer.push(g2.trim()) } return match }) console.log(outer) console.log(inner)

Answer 2

Instead of using a regex, you might use the dom or a domparser.

For example, create a div and set the innerHTML property to your html. Then loop the childnodes and get the innerHTML or the textContent.

For example:

 let htmlString = `aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc`, pre = [], text = []; let div = document.createElement('div'); div.innerHTML = htmlString; div.childNodes.forEach(x => { if (x.nodeType === Node.TEXT_NODE) { text.push(x.textContent.trim()) } if (x.nodeName === "PRE") { pre.push(x.innerHTML.trim()); } }); console.log(pre); console.log(text);

Capturing everything between <pre> </pre> tags

Question

2 answers

solution1
2 2019-02-24 04:35:52

solution2
0 2019-02-24 11:42:00

Capturing everything between <pre> </pre> tags

Question

2 answers

solution1 2 2019-02-24 04:35:52

solution2 0 2019-02-24 11:42:00

solution1
2 2019-02-24 04:35:52

solution2
0 2019-02-24 11:42:00