简体   繁体   中英

Using JavaScript, how do I transform an HTML string into an array of HTML tags and text content?

I have an HTML string such as:

<p>
    <strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.
</p>

I want to convert this into a JavaScript array that looks like:

['<p>', '<strong>', '<em>', 'Lorem Ipsum ', '</em>', '</strong>', 'is simply dummy text of the printing ', '<em>', 'and', '</em>', 'typesetting industry.', '</p>']

Ie it takes the HTML string and breaks it down into an array of tags and HTML content.

I have tried to use DomParser() as per this question:

const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const arr = [...doc.body.childNodes]
  .map(child => child.outerHTML || child.textContent);

However, this simply returns:

['<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>']

I have also tried to search for various Regex based solutions, but haven't been able to find any that break down the string exactly as I require.

Any suggestions?

Thanks

I'd make a recursive function to iterate over a given node and return an array of the text representation of its children:

 const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`; const doc = new DOMParser().parseFromString(str, 'text/html'); const parseNode = node => { const output = []; for (const child of node.childNodes) { if (child.nodeType === Node.TEXT_NODE) { output.push(child.textContent); } else if (child.nodeType === Node.ELEMENT_NODE) { output.push(`<${child.tagName}>`); output.push(...parseNode(child)); output.push(`</${child.tagName}>`); } } return output; }; console.log(parseNode(doc.body));

If you need to keep attributes too, you could take the outerHTML of the element and take the leading non-brackets:

 const str = `<p style="color:green"><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`; const doc = new DOMParser().parseFromString(str, 'text/html'); const parseNode = node => { const output = []; for (const child of node.childNodes) { if (child.nodeType === Node.TEXT_NODE) { output.push(child.textContent); } else if (child.nodeType === Node.ELEMENT_NODE) { const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1]; output.push(`<${child.tagName}${attribs}>`); output.push(...parseNode(child)); output.push(`</${child.tagName}>`); } } return output; }; console.log(parseNode(doc.body));

If you need self-closing tags not to be expanded, check if the outerHTML of an element contains </ :

 const str = `<p style="color:green"><input readonly value="x"/><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`; const doc = new DOMParser().parseFromString(str, 'text/html'); const parseNode = node => { const output = []; for (const child of node.childNodes) { if (child.nodeType === Node.TEXT_NODE) { output.push(child.textContent); } else if (child.nodeType === Node.ELEMENT_NODE) { const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1]; output.push(`<${child.tagName}${attribs}>`); if (child.outerHTML.includes('</')) { // Not self closing: output.push(...parseNode(child)); output.push(`</${child.tagName}>`); } } } return output; }; console.log(parseNode(doc.body));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM