簡體   English   中英

Java Jsoup搜索

[英]Java Jsoup search

我對Jsoup有一些疑問。 我有以下父頁面。 我想在HTML頁面上尋找一些標簽名稱,如果找到了它們,我想進入鏈接的標簽名稱內並搜索更多的標簽名稱。 但是首先,我想在控制台中給我標簽名稱。 這是我的HTML頁面。

    <div id="main">


<div class="box">
<!-- box / title -->
<div class="title">
<h5>
<input class="q_filter_box" id="q_filter" size="15" type="text" name="filter" placeholder="quick filter..." value=""/> 
<span class="groups_breadcrumbs">
<a href="/">Home</a>
&raquo; "samia" with
</span>
<span id="repo_count">0</span> repositories
</h5>
<ul class="links">
<li>

</li>
</ul>
</div>
<!-- end box / title -->
<div class="table">
<div id='groups_list_wrap' class="yui-skin-sam">
<table id="groups_list">
<thead>
<tr>
<th class="left"><a href="#">Section Name</a></th>
<th class="left"><a href="#">ID_Description</a></th>
</tr>
</thead>

<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/Export"><i class="icon-folder-close"></i> Export</a>
</div>
</td>
<td>samia/Export group</td>
</tr>
<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/Links"><i class="icon-folder-close"></i> Links</a>
</div>
</td>
<td>samia/Links group</td>
</tr>
<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/Platforms"><i class="icon-folder-close"></i> Platforms</a>
</div>
</td>
<td>samia/Platforms group</td>
</tr>
<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/LargeSml"><i class="icon-folder-close"></i> LargeSml</a>
</div>
</td>
<td>samia/LargeSml group</td>
</tr>
<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/Processes"><i class="icon-folder-close"></i> Processes</a>
</div>
</td>
<td>samia/Processes group</td>
</tr>
<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/Tills"><i class="icon-folder-close"></i> Tills</a>
</div>
</td>
<td>Tills für samia</td>
</tr>
</table>

首先,我要查看/顯示:導出鏈接平台LargeSml

其次,我想進入Export並搜索更多標簽,等等。

到目前為止,我有以下代碼,但似乎無法正常工作。

Document doc = Jsoup.connect("http://************").cookies(loginCookies).get();


            for (Element table : doc.select("groups_list")) {
                for (Element row : table.select("tr")) {
                    Elements tds = row.select("td");

                    System.out.println(tds.get(0).text());
                }
            }

這是我的第二種方法。 在頁面上,我想瀏覽表groups_list的標簽“導出”,“鏈接”,“平台”等。因此,如果在這些子頁面下(例如“導出”),我將搜索所有以Doc結尾的鏈接。 這些都是在javascript中。 后面是腳本。

      <script>
    var data = {"totalRecords": 2, "sort": "name", "startIndex": 0, "dir": "asc", "records": [{"raw_name": "samia/export/Citydata", "last_changeset": "\n  <div>\n      <pre><a title=\"xn00761:\n\nAdded tag V2.11.d50.mkt.001 for changeset 56e10a4864ff\" class=\"tooltip\" href=\"/samia/export/Citydata/changeset/f602409eba261d749d23dc75551b2959425dfa8d\">r17:f602409eba26</a></pre>\n  </div>\n", "atom": "\n    <a title=\"Subscribe to samia/export/Citydata atom feed\" href=\"/samia/export/Citydata/feed/atom?api_key=e214ebea2335318bee1460a1fd33725ab3e1002e\"><i class=\"icon-rss-sign\"  style=\"color: #fa9b39\"></i></a>\n", "owner": "owner", "rss": "\n    <a title=\"Subscribe to samia/export/Citydata rss feed\" href=\"/samia/export/Citydata/feed/rss?api_key=e214ebea2335318bee1460a1fd33725ab3e1002e\"><i class=\"icon-rss-sign\" style=\"color: #fa9b39\"></i></a>\n", "name": "\n    \n  <div style=\"white-space: nowrap; }\">\n        <a href=\"/samia/export/Citydata\">\n\n        <span title=\"Mercurial repository\"><i class=\"icon-hg\" style=\"color: #316293; font-size: 14px;\"></i></span>\n\n      <span style=\"margin: 0px 8px 0px 8px\"></span>\n    Citydata\n    </a>\n  </div>\n", "last_rev_raw": 17, "state": "\n  <div>\n        <div class=\"btn btn-mini btn-success disabled\">Created</div>\n  </div>\n", "menu": "\n  <ul class=\"menu_items hidden\">\n\n    <li style=\"border-top:1px solid #003367;margin-left:18px;padding-left:-99px\"></li>\n    <li>\n       <a title=\"Summary\" href=\"/samia/export/Citydata\">\n       <span class=\"icon\">\n           <i class=\"icon-file-text\"></i>\n       </span>\n       <span>Summary</span>\n       </a>\n    </li>\n    <li>\n       <a title=\"Changelog\" href=\"/samia/export/Citydata/changelog\">\n       <span class=\"icon\">\n           <i class=\"icon-list-alt\"></i>\n       </span>\n       <span>Changelog</span>\n       </a>\n    </li>\n    <li>\n       <a title=\"Files\" href=\"/samia/export/Citydata/files/tip/\">\n       <span class=\"icon\">\n           <i class=\"icon-file-alt\"></i>\n       </span>\n       <span>Files</span>\n       </a>\n    </li>\n    <li>\n       <a title=\"Fork\" href=\"/samia/export/Citydata/fork\">\n       <span class=\"icon\">\n           <i class=\"icon-code-fork\"></i>\n       </span>\n       <span>Fork</span>\n       </a>\n    </li>\n  </ul>\n", "desc": "HDB Marktdatenimport", "last_change": "\n  <span class=\"tooltip\" date=\"2014-08-21 18:49:50\" title=\"Thu, 21 Aug 2014 18:49:50\">6 days and 19 hours ago</span>\n"}, {"raw_name": "samia/export/CitydataDoc", "last_changeset": "\n  <div>\n      <pre><a title=\"xn01606 &amp;lt;owner;gt;:\n\nChangedokumentation\" class=\"tooltip\" href=\"/samia/export/CitydataDoc/changeset/9ed1679c7a35b76e1402b540cee38000461fdfdd\">r0:9ed1679c7a35</a></pre>\n  </div>\n", "atom": "\n    <a title=\"Subscribe to samia/export/CitydataDoc atom feed\" href=\"/samia/export/CitydataDoc/feed/atom?api_key=e214ebea2335318bee1460a1fd33725ab3e1002e\"><i class=\"icon-rss-sign\"  style=\"color: #fa9b39\"></i></a>\n", "owner": "xn00761 (Stefan Kortmann)", "rss": "\n    <a title=\"Subscribe to samia/export/CitydataDoc rss feed\" href=\"/samia/export/CitydataDoc/feed/rss?api_key=e214ebea2335318bee1460a1fd33725ab3e1002e\"><i class=\"icon-rss-sign\" style=\"color: #fa9b39\"></i></a>\n", "name": "\n    \n  <div style=\"white-space: nowrap; }\">\n        <a href=\"/samia/export/CitydataDoc\">\n\n        <span title=\"Mercurial repository\"><i class=\"icon-hg\" style=\"color: #316293; font-size: 14px;\"></i></span>\n\n      <span style=\"margin: 0px 8px 0px 8px\"></span>\n    CitydataDoc\n    </a>\n  </div>\n", "last_rev_raw": 0, "state": "\n  <div>\n        <div class=\"btn btn-mini btn-success disabled\">Created</div>\n  </div>\n", "menu": "\n  <ul class=\"menu_items hidden\">\n\n    <li style=\"border-top:1px solid #003367;margin-left:18px;padding-left:-99px\"></li>\n    <li>\n       <a title=\"Summary\" href=\"/samia/export/CitydataDoc\">\n       <span class=\"icon\">\n           <i class=\"icon-file-text\"></i>\n       </span>\n       <span>Summary</span>\n       </a>\n    </li>\n    <li>\n       <a title=\"Changelog\" href=\"/samia/export/CitydataDoc/changelog\">\n       <span class=\"icon\">\n           <i class=\"icon-list-alt\"></i>\n       </span>\n       <span>Changelog</span>\n       </a>\n    </li>\n    <li>\n       <a title=\"Files\" href=\"/samia/export/CitydataDoc/files/tip/\">\n       <span class=\"icon\">\n           <i class=\"icon-file-alt\"></i>\n       </span>\n       <span>Files</span>\n       </a>\n    </li>\n    <li>\n       <a title=\"Fork\" href=\"/samia/export/CitydataDoc/fork\">\n       <span class=\"icon\">\n           <i class=\"icon-code-fork\"></i>\n       </span>\n       <span>Fork</span>\n       </a>\n    </li>\n  </ul>\n", "desc": "HDB Marktdatenimport (Dokumentation)", "last_change": "\n  <span class=\"tooltip\" date=\"2014-04-25 11:03:45\" title=\"Fri, 25 Apr 2014 11:03:45\">4 months and 3 days ago</span>\n"}]};
    var myDataSource = new YAHOO.util.DataSource(data);
    myDataSource.responseType = YAHOO.util.DataSource.TYPE_JSON;

然后,當我瀏覽所有這些... Doc鏈接時,我想看看是否還有更多帶有特定標簽的鏈接。 在下面您會看到,我想瀏覽標記名稱“ r0:9ed1679c7a35”內的所有鏈接

<div class="box" style="margin-top: -20px">
<div class="title">
    <div class="breadcrumbs">
        <a href="/samia/Export/CitydataDoc/changelog">Latest changes</a>
    </div>
</div>
<div class="table">
    <div id="shortlog_data">
        <table class="table_disp">
<tr>
    <th class="left">Revision</th>
    <th class="left">Commit message</th>
    <th class="left">Age</th>
    <th class="left">Author</th>
    <th class="left">Refs</th>
</tr>
<tr class="parity0">
    <td>
      <div>
        <div class="changeset-status-container">
        </div>
        <pre><a href="/samia/Export/CitydataDoc/files/9ed1679c7a35b76e1402b540cee38000461fdfdd/">r0:9ed1679c7a35</a></pre>
     </div>
    </td>
    <td>
        <a class="message-link" href="/samia/Export/CitydataDoc/changeset/9ed1679c7a35b76e1402b540cee38000461fdfdd">Changedokumentation</a>
    </td>
    <td><span class="tooltip" title="Fri, 25 Apr 2014 11:03:45">
                  4 months and 3 days ago</span>
    </td>
    <td title="owner;">owner</td>
    <td>
         <div class="tagtag" title="Tag tip">
             <a href="/samia/Export/CitydataDoc/files/9ed1679c7a35b76e1402b540cee38000461fdfdd/">tip</a>
         </div>
         <div class="branchtag" title="Branch default">
             <a href="/samia/Export/CitydataDoc/changelog?branch=default">default</a>
         </div>
    </td>
</tr>

當我在一個名為“ r0:9ed1679c7a35”的鏈接下時,我具有以下HTML代碼。 在這里,我想瀏覽Changedokumentation.docx。

<div class="browser-body">
    <table class="code-browser">
        <thead>
            <tr>
                <th>Name</th>
                <th>Size</th>
                <th>Mimetype</th>
                <th>Last Revision</th>
                <th>Last modified</th>
                <th>Last committer</th>
            </tr>
        </thead>

        <tbody id="tbody">

            <tr class="parity0">
                 <td>

    <a class="browser-file ypjax-link" href="/samia/Export/CitydataDoc/files/9ed1679c7a35b76e1402b540cee38000461fdfdd/Changedokumentation_P0702_HDB_20140318.docx">Changedokumentation.docx</a>
                 </td>
                 <td>
                     133.5 KiB
                 </td>
                 <td>
                      application/vnd.openxmlformats-officedocument.wordprocessingml.document
                 </td>
                 <td>
                         <div class="tooltip" title="Changedokumentation">
                          <pre>r0:9ed1679c7a35</pre>
                         </div>
                 </td>
                 <td>
                         <span class="tooltip" title="Fri, 25 Apr 2014 11:03:45">
                        4 months and 3 days ago</span>
                 </td>
                 <td>
                         <span title="owner">
                        owner
                        </span>
                 </td>
            </tr>
        </tbody>
        <tbody id="tbody_filtered" style="display:none">
        </tbody>
    </table>
</div>

最后,當我到達“結束頁面”時,我希望首先列出名稱(因此我知道此文件已存在),並且還可以選擇下載文件。 當按下“下載為原始文件”時,我可以下載它。 這是在代碼中。

<div id="body" class="codeblock">
<div class="code-header">
    <div class="stats">
        <div class="left img"><i class="icon-file"></i></div>
        <div class="left item"><pre class="tooltip" title="Fri, 25 Apr 2014 11:03:45"><a href="/samia/Export/CityDataDoc/changeset/9ed1679c7a35b76e1402b540cee38000461fdfdd">r0:9ed1679c7a35</a></pre></div>
        <div class="left item"><pre>133.5 KiB</pre></div>
        <div class="left item last"><pre>application/vnd.openxmlformats-officedocument.wordprocessingml.document</pre></div>
        <div class="buttons">
            <a class="btn btn-mini" href="/samia/Export/CityDataDoc/annotate/9ed1679c7a35b76e1402b540cee38000461fdfdd/Changedokumentation.docx">Show Annotation</a>
          <a class="btn btn-mini" href="/samia/Export/CityDataDoc/raw/9ed1679c7a35b76e1402b540cee38000461fdfdd/Changedokumentation.docx">Show as Raw</a>
          <a class="btn btn-mini" href="/samia/Export/CityDataDoc/rawfile/9ed1679c7a35b76e1402b540cee38000461fdfdd/Changedokumentation.docx">Download as Raw</a>
            <a class="btn btn-mini disabled tooltip" href="#" title="Editing binary files not allowed">Edit</a>
            <a class="btn btn-mini btn-danger" href="/samia/Export/CityDataDoc/delete/default/Changedokumentation.docx#edit">Delete</a>
        </div>
    </div>
    <div class="author">
        <div class="gravatar">
            <img alt="gravatar" src="/images/user16.png"/>
        </div>
        <div title="owner" class="user">owner</div>
    </div>
    <div class="commit">Changedokumentation</div>
</div>
<div class="code-body">
       <div style="padding:5px">
       Binary file (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
       </div>
</div>

我知道那是太多的代碼。 但這很容易理解,我希望:-)

在JSOUP中使用css之類的選擇器 使用它的方式,它會搜索一個名為groups_list的元素標簽,當然找不到。 相反,您需要查找ID為groups_listtable 我認為這可以做到:

doc.select("table#groups_list")

將其放在外部循環中,至少應打印出td元素。

由於在html文檔中,該id應該是唯一的,因此您還可以執行以下操作:

doc.select("#groups_list")

要么

doc.select("table[id=groups_list]")

最后,如果您不想使用JSOUP的css select引擎,則可以使用JSOUP方法通過id直接訪問元素:

doc.getElementById("groups_list");

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM