简体   繁体   English

使用Jsoup从html提取注释

[英]Extracting comments from html using Jsoup

given this html source page i am trying to extract the comments : for example the first comment in this page: "Generated by the JDiff Javadoc doclet" i would like to extract this comment and all others in this document. 给定此html源页面,我试图提取注释:例如该页面的第一个注释:“由JDiff Javadoc doclet生成”,我想提取此注释以及本文档中的所有其他注释。

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <HTML style="overflow:auto;"> <HEAD> <meta name="generator" content="JDiff v1.1.0"> <!-- Generated by the JDiff Javadoc doclet --> <!-- (http://www.jdiff.org) --> <meta name="description" content="JDiff is a Javadoc doclet which generates an HTML report of all the packages, classes, constructors, methods, and fields which have been removed, added or changed in any way, including their documentation, when two APIs are compared."> <meta name="keywords" content="diff, jdiff, javadiff, java diff, java difference, API difference, difference between two APIs, API diff, Javadoc, doclet"> <TITLE> All Removals Index </TITLE> <link href="../../../../assets/android-developer-docs.css" rel="stylesheet" type="text/css" /> <link href="../stylesheet-jdiff.css" rel="stylesheet" type="text/css" /> <noscript> <style type="text/css"> body{overflow:auto;} #body-content{position:relative; top:0;} #doc-content{overflow:visible;border-left:3px solid #666;} #side-nav{padding:0;} #side-nav .toggle-list ul {display:block;} #resize-packages-nav{border-bottom:3px solid #666;} </style> </noscript> <style type="text/css"> </style> </HEAD> <BODY class="gc-documentation" style="padding:12px;"> <a NAME="topheader"></a> <table summary="Index for All Differences" width="100%" class="jdiffIndex" border="0" cellspacing="0" cellpadding="0" style="padding-bottom:0;margin-bottom:0;"> <tr> <th class="indexHeader"> Filter the Index: </th> </tr> <tr> <td class="indexText" style="line-height:1.3em;padding-left:2em;"> <a href="alldiffs_index_all.html" xclass="hiddenlink">All Differences</a> <br> <b>Removals</b> <br> <A HREF="alldiffs_index_additions.html"xclass="hiddenlink">Additions</A> <br> <A HREF="alldiffs_index_changes.html"xclass="hiddenlink">Changes</A> </td> </tr> </table> <div id="indexTableCaption" style="background-color:#eee;padding:0 4px 0 4px;font-size:11px;margin-bottom:.5em;"> Listed as: <span style="color:#069"><strong>Added</strong></span>, <span style="color:#069"><strike>Removed</strike></span>, <span style="color:#069">Changed</span></font> </div> <!-- Field CATEGORY_GADGET --> <A NAME="C"></A> <br><font size="+2">C</font>&nbsp; <a href="#D"><font size="-2">D</font></a> <a href="#F"><font size="-2">F</font></a> <a href="#N"><font size="-2">N</font></a> <a href="#S"><font size="-2">S</font></a> <a href="#topheader"><font size="-2">TOP</font></a> <p><div style="line-height:1.5em;color:black"> <nobr><A HREF="android.content.Intent.html#android.content.Intent.CATEGORY_GADGET" class="hiddenlink" target="rightframe"><strike>CATEGORY_GADGET</strike></A> </nobr><br> <!-- Method dragViewToBottom --> <A NAME="D"></A> <br><font size="+2">D</font>&nbsp; <a href="#C"><font size="-2">C</font></a> <a href="#F"><font size="-2">F</font></a> <a href="#N"><font size="-2">N</font></a> <a href="#S"><font size="-2">S</font></a> <a href="#topheader"><font size="-2">TOP</font></a> <p><div style="line-height:1.5em;color:black"> <nobr><A HREF="android.test.TouchUtils.html#android.test.TouchUtils.dragViewToBottom_removed(android.test.ActivityInstrumentationTestCase, android.view.View, int)" class="hiddenlink" target="rightframe"><strike>dragViewToBottom</strike> (<code>ActivityInstrumentationTestCase, View, int</code>)</A></nobr><br> <!-- Method forkAndSpecialize --> <A NAME="F"></A> <br><font size="+2">F</font>&nbsp; <a href="#C"><font size="-2">C</font></a> <a href="#D"><font size="-2">D</font></a> <a href="#N"><font size="-2">N</font></a> <a href="#S"><font size="-2">S</font></a> <a href="#topheader"><font size="-2">TOP</font></a> <p><div style="line-height:1.5em;color:black"> <nobr><A HREF="dalvik.system.Zygote.html#dalvik.system.Zygote.forkAndSpecialize_removed(int, int, int[], boolean, int[][])" class="hiddenlink" target="rightframe"><strike>forkAndSpecialize</strike> (<code>int, int, int[], boolean, int[][]</code>)</A></nobr><br> <!-- Method forkSystemServer --> <nobr><A HREF="dalvik.system.Zygote.html#dalvik.system.Zygote.forkSystemServer_removed(int, int, int[], boolean, int[][])" class="hiddenlink" target="rightframe"><strike>forkSystemServer</strike> (<code>int, int, int[], boolean, int[][]</code>)</A></nobr><br> <!-- Constructor NetworkInfo --> <A NAME="N"></A> <br><font size="+2">N</font>&nbsp; <a href="#C"><font size="-2">C</font></a> <a href="#D"><font size="-2">D</font></a> <a href="#F"><font size="-2">F</font></a> <a href="#S"><font size="-2">S</font></a> <a href="#topheader"><font size="-2">TOP</font></a> <p><div style="line-height:1.5em;color:black"> <nobr><A HREF="android.net.NetworkInfo.html#android.net.NetworkInfo.ctor_removed(int)" class="hiddenlink" target="rightframe"><strike>NetworkInfo</strike> (<code>int</code>)</A></nobr>&nbsp;constructor<br> <!-- Method setButton --> <A NAME="S"></A> <br><font size="+2">S</font>&nbsp; <a href="#C"><font size="-2">C</font></a> <a href="#D"><font size="-2">D</font></a> <a href="#F"><font size="-2">F</font></a> <a href="#N"><font size="-2">N</font></a> <a href="#topheader"><font size="-2">TOP</font></a> <p><div style="line-height:1.5em;color:black"> <i>setButton</i><br> &nbsp;&nbsp;<nobr><A HREF="android.app.AlertDialog.html#android.app.AlertDialog.setButton_removed(java.lang.CharSequence, android.content.DialogInterface.OnClickListener)" class="hiddenlink" target="rightframe">type&nbsp;<strike> (<code>CharSequence, OnClickListener</code>)</strike>&nbsp;in&nbsp;android.app.AlertDialog </A></nobr><br> <!-- Method setButton --> &nbsp;&nbsp;<nobr><A HREF="android.app.AlertDialog.html#android.app.AlertDialog.setButton_removed(java.lang.CharSequence, android.os.Message)" class="hiddenlink" target="rightframe">type&nbsp;<strike> (<code>CharSequence, Message</code>)</strike>&nbsp;in&nbsp;android.app.AlertDialog </A></nobr><br> <script src="//www.google-analytics.com/ga.js" type="text/javascript"> </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-5831155-1"); pageTracker._setAllowAnchor(true); pageTracker._initData(); pageTracker._trackPageview(); } catch(e) {} </script> </BODY> </HTML> 

I found a way to remove the comments using Jsoup at: https://gist.github.com/jhy/491407 我在以下网址找到了一种使用Jsoup删除评论的方法: https ://gist.github.com/jhy/491407

If you look at this code, probably you will be able to prepare extractComments method. 如果您看一下这段代码,也许您将能够准备extractComments方法。 I tried to implement this functionality and came up with this: 我尝试实现此功能并提出了以下建议:

private List<Comment> getComments(Node node) {
    List<Comment> comments = new ArrayList<Comment>();
    int i = 0;
    while (i < node.childNodes().size()) {
        Node child = node.childNode(i);
        if (child.nodeName().equals("#comment"))
            comments.add((Comment) child);
        else {
            comments.addAll(getComments(child));
        }
        i++;
    }
    return comments;
}

Example usage: 用法示例:

String page = "...."; //your page body
Document doc = Jsoup.parse(page);
List<Comment> comments = getComments(doc);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM