[英]Java Jsoup search
我對Jsoup有一些疑問。 我有以下父頁面。 我想在HTML頁面上尋找一些標簽名稱,如果找到了它們,我想進入鏈接的標簽名稱內並搜索更多的標簽名稱。 但是首先,我想在控制台中給我標簽名稱。 這是我的HTML頁面。
<div id="main">
<div class="box">
<!-- box / title -->
<div class="title">
<h5>
<input class="q_filter_box" id="q_filter" size="15" type="text" name="filter" placeholder="quick filter..." value=""/>
<span class="groups_breadcrumbs">
<a href="/">Home</a>
» "samia" with
</span>
<span id="repo_count">0</span> repositories
</h5>
<ul class="links">
<li>
</li>
</ul>
</div>
<!-- end box / title -->
<div class="table">
<div id='groups_list_wrap' class="yui-skin-sam">
<table id="groups_list">
<thead>
<tr>
<th class="left"><a href="#">Section Name</a></th>
<th class="left"><a href="#">ID_Description</a></th>
</tr>
</thead>
<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/Export"><i class="icon-folder-close"></i> Export</a>
</div>
</td>
<td>samia/Export group</td>
</tr>
<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/Links"><i class="icon-folder-close"></i> Links</a>
</div>
</td>
<td>samia/Links group</td>
</tr>
<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/Platforms"><i class="icon-folder-close"></i> Platforms</a>
</div>
</td>
<td>samia/Platforms group</td>
</tr>
<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/LargeSml"><i class="icon-folder-close"></i> LargeSml</a>
</div>
</td>
<td>samia/LargeSml group</td>
</tr>
<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/Processes"><i class="icon-folder-close"></i> Processes</a>
</div>
</td>
<td>samia/Processes group</td>
</tr>
<tr>
<td>
<div style="white-space: nowrap">
<a href="/samia/Tills"><i class="icon-folder-close"></i> Tills</a>
</div>
</td>
<td>Tills für samia</td>
</tr>
</table>
首先,我要查看/顯示:導出鏈接平台LargeSml
其次,我想進入Export並搜索更多標簽,等等。
到目前為止,我有以下代碼,但似乎無法正常工作。
Document doc = Jsoup.connect("http://************").cookies(loginCookies).get();
for (Element table : doc.select("groups_list")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text());
}
}
這是我的第二種方法。 在頁面上,我想瀏覽表groups_list的標簽“導出”,“鏈接”,“平台”等。因此,如果在這些子頁面下(例如“導出”),我將搜索所有以Doc結尾的鏈接。 這些都是在javascript中。 后面是腳本。
<script>
var data = {"totalRecords": 2, "sort": "name", "startIndex": 0, "dir": "asc", "records": [{"raw_name": "samia/export/Citydata", "last_changeset": "\n <div>\n <pre><a title=\"xn00761:\n\nAdded tag V2.11.d50.mkt.001 for changeset 56e10a4864ff\" class=\"tooltip\" href=\"/samia/export/Citydata/changeset/f602409eba261d749d23dc75551b2959425dfa8d\">r17:f602409eba26</a></pre>\n </div>\n", "atom": "\n <a title=\"Subscribe to samia/export/Citydata atom feed\" href=\"/samia/export/Citydata/feed/atom?api_key=e214ebea2335318bee1460a1fd33725ab3e1002e\"><i class=\"icon-rss-sign\" style=\"color: #fa9b39\"></i></a>\n", "owner": "owner", "rss": "\n <a title=\"Subscribe to samia/export/Citydata rss feed\" href=\"/samia/export/Citydata/feed/rss?api_key=e214ebea2335318bee1460a1fd33725ab3e1002e\"><i class=\"icon-rss-sign\" style=\"color: #fa9b39\"></i></a>\n", "name": "\n \n <div style=\"white-space: nowrap; }\">\n <a href=\"/samia/export/Citydata\">\n\n <span title=\"Mercurial repository\"><i class=\"icon-hg\" style=\"color: #316293; font-size: 14px;\"></i></span>\n\n <span style=\"margin: 0px 8px 0px 8px\"></span>\n Citydata\n </a>\n </div>\n", "last_rev_raw": 17, "state": "\n <div>\n <div class=\"btn btn-mini btn-success disabled\">Created</div>\n </div>\n", "menu": "\n <ul class=\"menu_items hidden\">\n\n <li style=\"border-top:1px solid #003367;margin-left:18px;padding-left:-99px\"></li>\n <li>\n <a title=\"Summary\" href=\"/samia/export/Citydata\">\n <span class=\"icon\">\n <i class=\"icon-file-text\"></i>\n </span>\n <span>Summary</span>\n </a>\n </li>\n <li>\n <a title=\"Changelog\" href=\"/samia/export/Citydata/changelog\">\n <span class=\"icon\">\n <i class=\"icon-list-alt\"></i>\n </span>\n <span>Changelog</span>\n </a>\n </li>\n <li>\n <a title=\"Files\" href=\"/samia/export/Citydata/files/tip/\">\n <span class=\"icon\">\n <i class=\"icon-file-alt\"></i>\n </span>\n <span>Files</span>\n </a>\n </li>\n <li>\n <a title=\"Fork\" href=\"/samia/export/Citydata/fork\">\n <span class=\"icon\">\n <i class=\"icon-code-fork\"></i>\n </span>\n <span>Fork</span>\n </a>\n </li>\n </ul>\n", "desc": "HDB Marktdatenimport", "last_change": "\n <span class=\"tooltip\" date=\"2014-08-21 18:49:50\" title=\"Thu, 21 Aug 2014 18:49:50\">6 days and 19 hours ago</span>\n"}, {"raw_name": "samia/export/CitydataDoc", "last_changeset": "\n <div>\n <pre><a title=\"xn01606 &lt;owner;gt;:\n\nChangedokumentation\" class=\"tooltip\" href=\"/samia/export/CitydataDoc/changeset/9ed1679c7a35b76e1402b540cee38000461fdfdd\">r0:9ed1679c7a35</a></pre>\n </div>\n", "atom": "\n <a title=\"Subscribe to samia/export/CitydataDoc atom feed\" href=\"/samia/export/CitydataDoc/feed/atom?api_key=e214ebea2335318bee1460a1fd33725ab3e1002e\"><i class=\"icon-rss-sign\" style=\"color: #fa9b39\"></i></a>\n", "owner": "xn00761 (Stefan Kortmann)", "rss": "\n <a title=\"Subscribe to samia/export/CitydataDoc rss feed\" href=\"/samia/export/CitydataDoc/feed/rss?api_key=e214ebea2335318bee1460a1fd33725ab3e1002e\"><i class=\"icon-rss-sign\" style=\"color: #fa9b39\"></i></a>\n", "name": "\n \n <div style=\"white-space: nowrap; }\">\n <a href=\"/samia/export/CitydataDoc\">\n\n <span title=\"Mercurial repository\"><i class=\"icon-hg\" style=\"color: #316293; font-size: 14px;\"></i></span>\n\n <span style=\"margin: 0px 8px 0px 8px\"></span>\n CitydataDoc\n </a>\n </div>\n", "last_rev_raw": 0, "state": "\n <div>\n <div class=\"btn btn-mini btn-success disabled\">Created</div>\n </div>\n", "menu": "\n <ul class=\"menu_items hidden\">\n\n <li style=\"border-top:1px solid #003367;margin-left:18px;padding-left:-99px\"></li>\n <li>\n <a title=\"Summary\" href=\"/samia/export/CitydataDoc\">\n <span class=\"icon\">\n <i class=\"icon-file-text\"></i>\n </span>\n <span>Summary</span>\n </a>\n </li>\n <li>\n <a title=\"Changelog\" href=\"/samia/export/CitydataDoc/changelog\">\n <span class=\"icon\">\n <i class=\"icon-list-alt\"></i>\n </span>\n <span>Changelog</span>\n </a>\n </li>\n <li>\n <a title=\"Files\" href=\"/samia/export/CitydataDoc/files/tip/\">\n <span class=\"icon\">\n <i class=\"icon-file-alt\"></i>\n </span>\n <span>Files</span>\n </a>\n </li>\n <li>\n <a title=\"Fork\" href=\"/samia/export/CitydataDoc/fork\">\n <span class=\"icon\">\n <i class=\"icon-code-fork\"></i>\n </span>\n <span>Fork</span>\n </a>\n </li>\n </ul>\n", "desc": "HDB Marktdatenimport (Dokumentation)", "last_change": "\n <span class=\"tooltip\" date=\"2014-04-25 11:03:45\" title=\"Fri, 25 Apr 2014 11:03:45\">4 months and 3 days ago</span>\n"}]};
var myDataSource = new YAHOO.util.DataSource(data);
myDataSource.responseType = YAHOO.util.DataSource.TYPE_JSON;
然后,當我瀏覽所有這些... Doc鏈接時,我想看看是否還有更多帶有特定標簽的鏈接。 在下面您會看到,我想瀏覽標記名稱“ r0:9ed1679c7a35”內的所有鏈接
<div class="box" style="margin-top: -20px">
<div class="title">
<div class="breadcrumbs">
<a href="/samia/Export/CitydataDoc/changelog">Latest changes</a>
</div>
</div>
<div class="table">
<div id="shortlog_data">
<table class="table_disp">
<tr>
<th class="left">Revision</th>
<th class="left">Commit message</th>
<th class="left">Age</th>
<th class="left">Author</th>
<th class="left">Refs</th>
</tr>
<tr class="parity0">
<td>
<div>
<div class="changeset-status-container">
</div>
<pre><a href="/samia/Export/CitydataDoc/files/9ed1679c7a35b76e1402b540cee38000461fdfdd/">r0:9ed1679c7a35</a></pre>
</div>
</td>
<td>
<a class="message-link" href="/samia/Export/CitydataDoc/changeset/9ed1679c7a35b76e1402b540cee38000461fdfdd">Changedokumentation</a>
</td>
<td><span class="tooltip" title="Fri, 25 Apr 2014 11:03:45">
4 months and 3 days ago</span>
</td>
<td title="owner;">owner</td>
<td>
<div class="tagtag" title="Tag tip">
<a href="/samia/Export/CitydataDoc/files/9ed1679c7a35b76e1402b540cee38000461fdfdd/">tip</a>
</div>
<div class="branchtag" title="Branch default">
<a href="/samia/Export/CitydataDoc/changelog?branch=default">default</a>
</div>
</td>
</tr>
當我在一個名為“ r0:9ed1679c7a35”的鏈接下時,我具有以下HTML代碼。 在這里,我想瀏覽Changedokumentation.docx。
<div class="browser-body">
<table class="code-browser">
<thead>
<tr>
<th>Name</th>
<th>Size</th>
<th>Mimetype</th>
<th>Last Revision</th>
<th>Last modified</th>
<th>Last committer</th>
</tr>
</thead>
<tbody id="tbody">
<tr class="parity0">
<td>
<a class="browser-file ypjax-link" href="/samia/Export/CitydataDoc/files/9ed1679c7a35b76e1402b540cee38000461fdfdd/Changedokumentation_P0702_HDB_20140318.docx">Changedokumentation.docx</a>
</td>
<td>
133.5 KiB
</td>
<td>
application/vnd.openxmlformats-officedocument.wordprocessingml.document
</td>
<td>
<div class="tooltip" title="Changedokumentation">
<pre>r0:9ed1679c7a35</pre>
</div>
</td>
<td>
<span class="tooltip" title="Fri, 25 Apr 2014 11:03:45">
4 months and 3 days ago</span>
</td>
<td>
<span title="owner">
owner
</span>
</td>
</tr>
</tbody>
<tbody id="tbody_filtered" style="display:none">
</tbody>
</table>
</div>
最后,當我到達“結束頁面”時,我希望首先列出名稱(因此我知道此文件已存在),並且還可以選擇下載文件。 當按下“下載為原始文件”時,我可以下載它。 這是在代碼中。
<div id="body" class="codeblock">
<div class="code-header">
<div class="stats">
<div class="left img"><i class="icon-file"></i></div>
<div class="left item"><pre class="tooltip" title="Fri, 25 Apr 2014 11:03:45"><a href="/samia/Export/CityDataDoc/changeset/9ed1679c7a35b76e1402b540cee38000461fdfdd">r0:9ed1679c7a35</a></pre></div>
<div class="left item"><pre>133.5 KiB</pre></div>
<div class="left item last"><pre>application/vnd.openxmlformats-officedocument.wordprocessingml.document</pre></div>
<div class="buttons">
<a class="btn btn-mini" href="/samia/Export/CityDataDoc/annotate/9ed1679c7a35b76e1402b540cee38000461fdfdd/Changedokumentation.docx">Show Annotation</a>
<a class="btn btn-mini" href="/samia/Export/CityDataDoc/raw/9ed1679c7a35b76e1402b540cee38000461fdfdd/Changedokumentation.docx">Show as Raw</a>
<a class="btn btn-mini" href="/samia/Export/CityDataDoc/rawfile/9ed1679c7a35b76e1402b540cee38000461fdfdd/Changedokumentation.docx">Download as Raw</a>
<a class="btn btn-mini disabled tooltip" href="#" title="Editing binary files not allowed">Edit</a>
<a class="btn btn-mini btn-danger" href="/samia/Export/CityDataDoc/delete/default/Changedokumentation.docx#edit">Delete</a>
</div>
</div>
<div class="author">
<div class="gravatar">
<img alt="gravatar" src="/images/user16.png"/>
</div>
<div title="owner" class="user">owner</div>
</div>
<div class="commit">Changedokumentation</div>
</div>
<div class="code-body">
<div style="padding:5px">
Binary file (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
</div>
</div>
我知道那是太多的代碼。 但這很容易理解,我希望:-)
您在JSOUP中使用css之類的選擇器 。 使用它的方式,它會搜索一個名為groups_list
的元素標簽,當然找不到。 相反,您需要查找ID為groups_list
的table
。 我認為這可以做到:
doc.select("table#groups_list")
將其放在外部循環中,至少應打印出td
元素。
由於在html文檔中,該id應該是唯一的,因此您還可以執行以下操作:
doc.select("#groups_list")
要么
doc.select("table[id=groups_list]")
最后,如果您不想使用JSOUP的css select引擎,則可以使用JSOUP方法通過id直接訪問元素:
doc.getElementById("groups_list");
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.