How can I get only tags of html?

Question

How can I get only HTML tags with NodeJS ?

I have this:

<html>
<head>
Hi
</head>
<body>
<center id="fantastic">
Hi , hello
</center>
</body>
</html>

And I want to delete Hi and Hi , Hello and get only the tags, and i want remove too the id="fantastic". Any idea? Any regular expression?

Answer 1

Assuming you have the source HTML in a Javascript string and that it is legal HTML and the HTML attributes don't contain ">" or "<" characters, this should work:

var source = "your html here";

var result = source.match(/<.*?>/g).map(function(item) {
    return item.replace(/<\s+/, "<").replace(/\s.*?(\/?>)$/, "$1");
}).join("");

Working demo: http://jsfiddle.net/jfriend00/6q0gyugd/

This uses a regex to isolate just the HTML tags into an array and then uses .map() to iterate through that array to remove any leading whitespace in the tag and then to remove any attributes from each tag, then joins them back into a string of HTML.

To be the most robust with any possible legal HTML, you may as well just use an actual HTML parser (which can be smarter than any regex can possibly be) to parse the actual HTML tags, then walk the parsed tree to output just the tags.

Answer 2

您可以尝试使用像cheerio这样的库cheerio : //github.com/cheeriojs/cheerio

How can I get only tags of html?

Question

2 answers

solution1
2 ACCPTED 2015-05-23 20:04:13

solution2
0 2015-05-23 16:52:27

How can I get only tags of html?

Question

2 answers

solution1 2 ACCPTED 2015-05-23 20:04:13

solution2 0 2015-05-23 16:52:27

solution1
2 ACCPTED 2015-05-23 20:04:13

solution2
0 2015-05-23 16:52:27