简体   繁体   中英

How can I get only tags of html?

How can I get only HTML tags with NodeJS ?

I have this:

<html>
<head>
Hi
</head>
<body>
<center id="fantastic">
Hi , hello
</center>
</body>
</html>

And I want to delete Hi and Hi , Hello and get only the tags, and i want remove too the id="fantastic". Any idea? Any regular expression?

Assuming you have the source HTML in a Javascript string and that it is legal HTML and the HTML attributes don't contain ">" or "<" characters, this should work:

var source = "your html here";

var result = source.match(/<.*?>/g).map(function(item) {
    return item.replace(/<\s+/, "<").replace(/\s.*?(\/?>)$/, "$1");
}).join("");

Working demo: http://jsfiddle.net/jfriend00/6q0gyugd/

This uses a regex to isolate just the HTML tags into an array and then uses .map() to iterate through that array to remove any leading whitespace in the tag and then to remove any attributes from each tag, then joins them back into a string of HTML.


To be the most robust with any possible legal HTML, you may as well just use an actual HTML parser (which can be smarter than any regex can possibly be) to parse the actual HTML tags, then walk the parsed tree to output just the tags.

您可以尝试使用像cheerio这样的库cheerio : //github.com/cheeriojs/cheerio

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM