简体   繁体   中英

Extracting comments from html using Jsoup

given this html source page i am trying to extract the comments : for example the first comment in this page: "Generated by the JDiff Javadoc doclet" i would like to extract this comment and all others in this document.

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <HTML style="overflow:auto;"> <HEAD> <meta name="generator" content="JDiff v1.1.0"> <!-- Generated by the JDiff Javadoc doclet --> <!-- (http://www.jdiff.org) --> <meta name="description" content="JDiff is a Javadoc doclet which generates an HTML report of all the packages, classes, constructors, methods, and fields which have been removed, added or changed in any way, including their documentation, when two APIs are compared."> <meta name="keywords" content="diff, jdiff, javadiff, java diff, java difference, API difference, difference between two APIs, API diff, Javadoc, doclet"> <TITLE> All Removals Index </TITLE> <link href="../../../../assets/android-developer-docs.css" rel="stylesheet" type="text/css" /> <link href="../stylesheet-jdiff.css" rel="stylesheet" type="text/css" /> <noscript> <style type="text/css"> body{overflow:auto;} #body-content{position:relative; top:0;} #doc-content{overflow:visible;border-left:3px solid #666;} #side-nav{padding:0;} #side-nav .toggle-list ul {display:block;} #resize-packages-nav{border-bottom:3px solid #666;} </style> </noscript> <style type="text/css"> </style> </HEAD> <BODY class="gc-documentation" style="padding:12px;"> <a NAME="topheader"></a> <table summary="Index for All Differences" width="100%" class="jdiffIndex" border="0" cellspacing="0" cellpadding="0" style="padding-bottom:0;margin-bottom:0;"> <tr> <th class="indexHeader"> Filter the Index: </th> </tr> <tr> <td class="indexText" style="line-height:1.3em;padding-left:2em;"> <a href="alldiffs_index_all.html" xclass="hiddenlink">All Differences</a> <br> <b>Removals</b> <br> <A HREF="alldiffs_index_additions.html"xclass="hiddenlink">Additions</A> <br> <A HREF="alldiffs_index_changes.html"xclass="hiddenlink">Changes</A> </td> </tr> </table> <div id="indexTableCaption" style="background-color:#eee;padding:0 4px 0 4px;font-size:11px;margin-bottom:.5em;"> Listed as: <span style="color:#069"><strong>Added</strong></span>, <span style="color:#069"><strike>Removed</strike></span>, <span style="color:#069">Changed</span></font> </div> <!-- Field CATEGORY_GADGET --> <A NAME="C"></A> <br><font size="+2">C</font>&nbsp; <a href="#D"><font size="-2">D</font></a> <a href="#F"><font size="-2">F</font></a> <a href="#N"><font size="-2">N</font></a> <a href="#S"><font size="-2">S</font></a> <a href="#topheader"><font size="-2">TOP</font></a> <p><div style="line-height:1.5em;color:black"> <nobr><A HREF="android.content.Intent.html#android.content.Intent.CATEGORY_GADGET" class="hiddenlink" target="rightframe"><strike>CATEGORY_GADGET</strike></A> </nobr><br> <!-- Method dragViewToBottom --> <A NAME="D"></A> <br><font size="+2">D</font>&nbsp; <a href="#C"><font size="-2">C</font></a> <a href="#F"><font size="-2">F</font></a> <a href="#N"><font size="-2">N</font></a> <a href="#S"><font size="-2">S</font></a> <a href="#topheader"><font size="-2">TOP</font></a> <p><div style="line-height:1.5em;color:black"> <nobr><A HREF="android.test.TouchUtils.html#android.test.TouchUtils.dragViewToBottom_removed(android.test.ActivityInstrumentationTestCase, android.view.View, int)" class="hiddenlink" target="rightframe"><strike>dragViewToBottom</strike> (<code>ActivityInstrumentationTestCase, View, int</code>)</A></nobr><br> <!-- Method forkAndSpecialize --> <A NAME="F"></A> <br><font size="+2">F</font>&nbsp; <a href="#C"><font size="-2">C</font></a> <a href="#D"><font size="-2">D</font></a> <a href="#N"><font size="-2">N</font></a> <a href="#S"><font size="-2">S</font></a> <a href="#topheader"><font size="-2">TOP</font></a> <p><div style="line-height:1.5em;color:black"> <nobr><A HREF="dalvik.system.Zygote.html#dalvik.system.Zygote.forkAndSpecialize_removed(int, int, int[], boolean, int[][])" class="hiddenlink" target="rightframe"><strike>forkAndSpecialize</strike> (<code>int, int, int[], boolean, int[][]</code>)</A></nobr><br> <!-- Method forkSystemServer --> <nobr><A HREF="dalvik.system.Zygote.html#dalvik.system.Zygote.forkSystemServer_removed(int, int, int[], boolean, int[][])" class="hiddenlink" target="rightframe"><strike>forkSystemServer</strike> (<code>int, int, int[], boolean, int[][]</code>)</A></nobr><br> <!-- Constructor NetworkInfo --> <A NAME="N"></A> <br><font size="+2">N</font>&nbsp; <a href="#C"><font size="-2">C</font></a> <a href="#D"><font size="-2">D</font></a> <a href="#F"><font size="-2">F</font></a> <a href="#S"><font size="-2">S</font></a> <a href="#topheader"><font size="-2">TOP</font></a> <p><div style="line-height:1.5em;color:black"> <nobr><A HREF="android.net.NetworkInfo.html#android.net.NetworkInfo.ctor_removed(int)" class="hiddenlink" target="rightframe"><strike>NetworkInfo</strike> (<code>int</code>)</A></nobr>&nbsp;constructor<br> <!-- Method setButton --> <A NAME="S"></A> <br><font size="+2">S</font>&nbsp; <a href="#C"><font size="-2">C</font></a> <a href="#D"><font size="-2">D</font></a> <a href="#F"><font size="-2">F</font></a> <a href="#N"><font size="-2">N</font></a> <a href="#topheader"><font size="-2">TOP</font></a> <p><div style="line-height:1.5em;color:black"> <i>setButton</i><br> &nbsp;&nbsp;<nobr><A HREF="android.app.AlertDialog.html#android.app.AlertDialog.setButton_removed(java.lang.CharSequence, android.content.DialogInterface.OnClickListener)" class="hiddenlink" target="rightframe">type&nbsp;<strike> (<code>CharSequence, OnClickListener</code>)</strike>&nbsp;in&nbsp;android.app.AlertDialog </A></nobr><br> <!-- Method setButton --> &nbsp;&nbsp;<nobr><A HREF="android.app.AlertDialog.html#android.app.AlertDialog.setButton_removed(java.lang.CharSequence, android.os.Message)" class="hiddenlink" target="rightframe">type&nbsp;<strike> (<code>CharSequence, Message</code>)</strike>&nbsp;in&nbsp;android.app.AlertDialog </A></nobr><br> <script src="//www.google-analytics.com/ga.js" type="text/javascript"> </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-5831155-1"); pageTracker._setAllowAnchor(true); pageTracker._initData(); pageTracker._trackPageview(); } catch(e) {} </script> </BODY> </HTML> 

I found a way to remove the comments using Jsoup at: https://gist.github.com/jhy/491407

If you look at this code, probably you will be able to prepare extractComments method. I tried to implement this functionality and came up with this:

private List<Comment> getComments(Node node) {
    List<Comment> comments = new ArrayList<Comment>();
    int i = 0;
    while (i < node.childNodes().size()) {
        Node child = node.childNode(i);
        if (child.nodeName().equals("#comment"))
            comments.add((Comment) child);
        else {
            comments.addAll(getComments(child));
        }
        i++;
    }
    return comments;
}

Example usage:

String page = "...."; //your page body
Document doc = Jsoup.parse(page);
List<Comment> comments = getComments(doc);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM