I am reading in a .html file:
const htmlin = String(fs.readFileSync(inputHtml) || '');
const splitted = htmlin.split(/<pre.*>/);
splitted.shift();
const justPost = splitted.join('').split('</pre>');
justPost.pop();
but I am looking for a way to match all the text within
aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc
and also match the text outside. So that I can get two arrays:
['aaa ', ' bbb ', ' ccc']
and
[' xxx ', ' foo ']
how can I do this with regex or some other method?
One way is to use regex replace function and capturing group.
<pre>(.*?)(?=<\/pre>)|(?:^|<\/pre>)(.*?)(?=$|<pre>)
<pre>(.*?)(?=<\\/pre>)
- Matches text between pre
tags. ( g1 ) (?:^|<\\/pre>)(.*?)(?=$|<pre>)
- Matches text out of pre
tags. (g2) let str = `aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc` let inner = [] let outer = [] let op = str.replace(/<pre>(.*?)(?=<\\/pre>)|(?:^|<\\/pre>)(.*?)(?=$|<pre>)/g, function (match,g1,g2){ if(g1){ inner.push(g1.trim()) } if(g2){ outer.push(g2.trim()) } return match }) console.log(outer) console.log(inner)
Instead of using a regex, you might use the dom or a domparser.
For example, create a div and set the innerHTML property to your html. Then loop the childnodes and get the innerHTML or the textContent.
For example:
let htmlString = `aaa <pre> xxx </pre> bbb <pre> foo </pre> ccc`, pre = [], text = []; let div = document.createElement('div'); div.innerHTML = htmlString; div.childNodes.forEach(x => { if (x.nodeType === Node.TEXT_NODE) { text.push(x.textContent.trim()) } if (x.nodeName === "PRE") { pre.push(x.innerHTML.trim()); } }); console.log(pre); console.log(text);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.