如何仅获取html标签？

Question

How can I get only HTML tags with NodeJS ? 如何使用NodeJS仅获得HTML标签？

I have this: 我有这个：

<html>
<head>
Hi
</head>
<body>
<center id="fantastic">
Hi , hello
</center>
</body>
</html>

And I want to delete Hi and Hi , Hello and get only the tags, and i want remove too the id="fantastic". 而且我想删除Hi和Hi，Hello并仅获取标签，并且我也想删除id =“ fantastic”。 Any idea? 任何想法？ Any regular expression? 任何正则表达式？

Answer 1

Assuming you have the source HTML in a Javascript string and that it is legal HTML and the HTML attributes don't contain ">" or "<" characters, this should work: 假设您在Javascript字符串中具有源HTML，并且它是合法的HTML，并且HTML属性不包含“>”或“ <”字符，则此方法应该起作用：

var source = "your html here";

var result = source.match(/<.*?>/g).map(function(item) {
    return item.replace(/<\s+/, "<").replace(/\s.*?(\/?>)$/, "$1");
}).join("");

Working demo: http://jsfiddle.net/jfriend00/6q0gyugd/ 工作演示： http : //jsfiddle.net/jfriend00/6q0gyugd/

This uses a regex to isolate just the HTML tags into an array and then uses .map() to iterate through that array to remove any leading whitespace in the tag and then to remove any attributes from each tag, then joins them back into a string of HTML. 这使用正则表达式将HTML标记仅隔离到一个数组中，然后使用.map()遍历该数组以删除标记中的所有前导空格，然后从每个标记中删除任何属性，然后将它们重新连接成字符串HTML。

To be the most robust with any possible legal HTML, you may as well just use an actual HTML parser (which can be smarter than any regex can possibly be) to parse the actual HTML tags, then walk the parsed tree to output just the tags. 为了在任何可能的合法HTML上都具有最强的鲁棒性，您也可以只使用实际的HTML解析器（它比任何正则表达式都可能更聪明）来解析实际的HTML标签，然后遍历解析的树以仅输出标签。

Answer 2

您可以尝试使用像cheerio这样的库cheerio : //github.com/cheeriojs/cheerio

如何仅获取html标签？

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-05-23 20:04:13

解决方案2
0 2015-05-23 16:52:27

如何仅获取html标签？

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-05-23 20:04:13

解决方案2 0 2015-05-23 16:52:27

解决方案1
2 已采纳 2015-05-23 20:04:13

解决方案2
0 2015-05-23 16:52:27