简体   繁体   English

使用 Java 解析 HTML 文件

[英]Parsing an HTML file using Java

How can remove the comments and contents of the comments from an html file using Java where the comments are written like:如何使用 Java 从 html 文件中删除评论和评论内容,其中评论的写法如下:

<!--

Any idea or help needed on this.对此需要任何想法或帮助。

Take a look at JTidy , the java port of HTML Tidy.看看JTidy ,HTML Tidy 的 java 端口。 You could override the print methods of the PPrint object to ignore the comment tags.您可以覆盖 PPrint object 的打印方法以忽略注释标签。

If you don't have valid xhtml, which a comment posted reminded me of, you should at first apply jtidy to tidy up the html and make it valid xhtml.如果您没有有效的 xhtml,发布的评论提醒我,您应该首先应用jtidy来整理 html 并使其有效的 xhtml。

See this for example code on jtidy.有关 jtidy 的示例代码,请参见this

Then I'd convert the html to a DOM instance.然后我会将 html 转换为 DOM 实例。

Like so:像这样:

final DocumentBuilderFactory newFactory = DocumentBuilderFactory.newInstance();
final DocumentBuilder documentBuilder = newFactory.newDocumentBuilder();
Document document = documentBuilder.parse( new InputSource( new StringReader( string ) ) );

Then I'd navigate through the document tree and modify nodes as needed.然后,我将浏览文档树并根据需要修改节点。

try a simple regex like尝试一个简单的正则表达式

String commentless = pageString.replaceAll("<!--[\w\W]*?-->", "");

edit: to explain the regex:编辑:解释正则表达式:

  • <!-- matches the literal comment start <!--匹配文字注释开始
  • [\w\W] matches every character (even newlines) which will be inside the comment [\w\W]匹配注释内的每个字符(甚至换行符)
  • *? matches multiple of the 'any character' but matches the smallest amount possible (not greedy)匹配多个“任意字符”,但匹配可能的最小数量(不贪心)
  • --> closes the comment -->关闭评论

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM