![](/img/trans.png)
[英]Select and iterate through elements and sub elements with same name (Jsoup)
[英]How to select sub elements with same tag in the same element jsoup?
我需要使用元素标签div
, h3
, a
等通过jsoup来解析页面。我想通过div.g
元素来解析并获取以下类的文本: a class="l _PMs"
和a class="_pJs"
显示在jList
。
以Google新闻为例,该页面如下所示:
<div class="g">
<div class="ts _JGs _KHs _oGs _KGs _jHs">
<a class="top _xGs _SHs" href="url" onmousedown="return rwt(this,'','','','1','dfda','','sdfa','','',event)">
<img class="th _RGs" src="url" alt="Story image" onload="typeof google==='object'&&google.aft&&google.aft(this)">
</a>
<div class="_hJs">
<h3 class="r _gJs">
<a class="l _PMs" href="url" onmousedown="return rwt(this,'','','','1','dfs','','sdfa','','',event)">Report on <em>Example</em> Testing<em>Club</em> ...</a>
</h3>
<div class="slp">
<span class="_OHs _PHs">link</span>
<span class="_QGs">-</span>
<span class="f nsa _QHs">date</span>
</div>
<div class="st">description</div>
</div>
<div class="_sJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','sdf','','sdfa','','',event)" data-href="url">Final review of <em>example's</em> of <em>testing</em>...
</a>
</div>
<div class="_cJs"></div>
<div class="_sJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','dfa','','dfs-d','','',event)" data-href="url">Report on this testing
</a>
</div>
<div class="_cJs"></div>
<div class="_eJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','ad','','dfsaf','','',event)">Test report example
</a>
</div>
<div class="_cJs"></div>
</div>
</div>
<div class="g">
<div class="ts _JGs _KHs _oGs _KGs _jHs">
<a class="top _xGs _SHs" href="url" onmousedown="return rwt(this,'','','','1','dfda','','sdfa','','',event)">
<img class="th _RGs" src="url" alt="Story image" onload="typeof google==='object'&&google.aft&&google.aft(this)">
</a>
<div class="_hJs">
<h3 class="r _gJs">
<a class="l _PMs" href="url" onmousedown="return rwt(this,'','','','1','dfs','','sdfa','','',event)">Cloud<em>Example</em> Testing<em>1</em> ...</a>
</h3>
<div class="slp">
<span class="_OHs _PHs">link</span>
<span class="_QGs">-</span>
<span class="f nsa _QHs">date</span>
</div>
<div class="st">description</div>
</div>
<div class="_sJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','sdf','','sdfa','','',event)" data-href="url">Final review of this<em>testing</em>...
</a>
</div>
<div class="_cJs"></div>
<div class="_sJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','dfa','','dfs-d','','',event)" data-href="url">Report on this...
</a>
</div>
<div class="_cJs"></div>
<div class="_eJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','ad','','dfsaf','','',event)">Example 2...
</a>
</div>
<div class="_cJs"></div>
<div class="tsw _QMs">
<div class="_jJs card-section">
<a class="_MHs" href="url" target="_blank" onmousedown="return rwt(this,'','','','2','sdfs','','dfd','','',event)" data-href="url">
<img class="_iJs" id="news-media-image-52779751835836-0" src="url" alt="image1" onload="typeof google==='object'&&google.aft&&google.aft(this)">
<div class="_RMs">USA TODAY.</div>
</a>
<a class="_MHs" href="url" target="_blank" onmousedown="return rwt(this,'','','','2','sdfsa','','dsfa','','',event)">
<img class="_iJs" id="news-media-image-52779751835836-1" src="url" alt="image2" onload="typeof google==='object'&&google.aft&&google.aft(this)">
<div class="_RMs">image2./div>
</a>
</div>
<div class="_NMs">
<a class="_OMs" href="url">View all
</a>
</div>
</div>
</div>
</div>
这是代码:
String input = txtSearch.getText();
input = input.replace(" ", "+");
String url = "http://www.google.com/search?q=" + input + "&tbm=nws&source=lnms";
try {
Document doc = Jsoup.connect(url).userAgent("Chrome").timeout(5000).get();
Elements e = doc.select("div.g");
DefaultListModel<String> listModel = new DefaultListModel<>();
e.forEach((e1) -> {
e1.getElementsByTag("a").forEach(linkElement -> listModel.addElement(linkElement.text()));
});
newsList.setModel(listModel);
} catch (IOException ex) {
Logger.getLogger(MainUI.class.getName()).log(Level.SEVERE, null, ex);
}
jList
中显示的实际输出为:
Report on Example Testing Club...
Final review of example's of testing...
Report on this testing.
Test report example.
Cloud Example Testing 1.
Final review of this testing.
Report on this...
Example 2...
USA TODAY.
image2.
View all
我如何选择这些类:没有a class=_MHs
和a class=_OMs
a class="l _PMs"
和a class="_pJs"
,如下所示(在jList
):
Report on Example Testing Club...
Final review of example's of testing...
Report on this testing.
Test report example.
Cloud Example Testing 1.
Final review of this testing.
Report on this...
Example 2...
只需更改此行:
Elements e = doc.select("div.g");
至
Elements e = doc.select("div.g").select("div.a");
循环中仅检查文本,例如:
for(Element element:e)
{
yourList.add(e.text());
}
元素e = doc.select(“ div.g”)。select(“ a”); 我们将列出div.g标签的每个标签元素。 因此,现在我们可以通过for循环遍历每个标签,并查找文本甚至属性。
问题是,你选择所有a
给定的内部元素div
和调用.text()
方法的所有元素的这份名单上-它自然返回你所有的连锁文字a
元素。
为了使代码按预期工作,您可以更改:
e.forEach((e1) -> {
listModel.addElement(e1.getElementsByTag("a").text());
});
至:
e.forEach((e1) -> {
e1.getElementsByTag("a").forEach(linkElement -> listModel.addElement(linkElement.text()));
});
更新
如果只想选择l
+ _PMs
或_pJs
类a
元素,则可以这样重写代码:
Document doc = Jsoup.connect(url).userAgent("Chrome").timeout(5000).get();
DefaultListModel<String> listModel = new DefaultListModel<>();
doc.select("div.g a.l._PMs, div.g a._pJs")
.forEach(element -> listModel.addElement(element.text()));
newsList.setModel(listModel);
选择器为: div.g al_PMs, div.g a._pJs
,这意味着选择满足以下条件之一的所有元素:
a
具有元件l
和_PMs
类是内部div
与元件g
类 _pJs
类a
元素内, a
元素位于具有g
类的div
元素内
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.