简体   繁体   English

如何使用java库“HTML Parser”删除所有<style> tags?

[英]How do I use the java library “HTML Parser” to remove all <style> tags?

I need to perform several action on a html file such as removing a specific tag or delete attributes. 我需要对html文件执行多个操作,例如删除特定标记或删除属性。 I decided to use HTML Parser, a java library: http://htmlparser.sourceforge.net/ 我决定使用HTML Parser,一个java库: http//htmlparser.sourceforge.net/

First of all, I want to remove all the style tags. 首先,我想删除所有样式标签。 I managed to get a NodeList containing all the styles tag by doing this: 我设法通过这样做得到一个包含所有样式标签的NodeList:

Parser parser = new Parser (url);
NodeList list = parser.parse (null);            
NodeList styles = list.extractAllNodesThatMatch (new TagNameFilter ("STYLE"), true);

Now I don't know how to delete this style attributes from the whole list of nodes. 现在我不知道如何从整个节点列表中删除这个样式属性。 Do I have to fetch the whole list? 我必须获取整个列表吗?

After that, I want to be able to delete all the attributes inside the tags or delete only the alt attributes for example. 之后,我希望能够删除标签内的所有属性,或者仅删除alt属性。 Is there a method which does that automatically? 有没有一种自动完成的方法?

From the documentation, the Parser returns a list of trees that contains all of your html's nodes (think of the parser as the root node of a big tree of Node and each "level" of that tree is a NodeList ). 从文档中, Parser返回包含所有html节点的树列表(将解析器视为Node树的根节点,并且该树的每个“级别”都是NodeList )。

You can iterate through the tree recursively, test each node's type against StyleTag and delete it from the appropriate NodeList when applicable. 您可以递归地遍历树,针对StyleTag测试每个节点的类型,并在适用时从适当的NodeList删除它。 Keep descending into the tree recursively until you visit all its nodes. 继续以递归方式下降到树中,直到您访问其所有节点。

NodeTreeWalker is your friend and can help you with the recursive tree traversal. NodeTreeWalker是您的朋友,可以帮助您进行递归树遍历。

jsoup is another nice alternative that has a simpler interface (see this other question ). jsoup是另一个有更简单接口的好选择(参见另一个问题 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM