[英]Extracting texts from an element in a HTML page using Jsoup
I am extracting texts from the following html element 我正在从以下html元素中提取文本
<span class="adr" style="float: none !important;">
<span class="street-address" style="float: none !important;">18, Jawaharlal Nehru
Road,
</span>
<span style="float: none !important;" class="estb_addr-HeadingTxt">
<a style="float: none !important;" href="http://kolkata.burrp.com/area/park-street" class="locality"> Park Street</a></span>
, Kolkata<span class="region" style="display: none;">Kolkata
</span>
</span>
For that I wrote the following piece of code: 为此,我编写了以下代码:
for (Element element : doc.getAllElements())
{
for(Element childelem: element.children())
{
if (childelem.hasText() && !childelem.ownText().isEmpty())
{
String currText=childelem.ownText();
System.out.print(currText+" ");
}
}
System.out.println("");
}
Ideally the output should be 18, Jawaharlal Nehru Road, Park Street, Kolkata .
理想情况下,输出应为加尔各答Park Street Jawaharlal Nehru Road 18 。 But it is giving
18, Jawaharlal Nehru Road, Kolkata and
Park Street .
但它给了18号路,贾瓦哈拉尔·尼赫鲁路,加尔各答和
公园街 。 I can understand that the output is basically inorder traversal of the DOM tree rooted at outer <span>.
我可以理解,输出基本上是对以外部<span>为根的DOM树的有序遍历。 But I don't know exactly how to achieve that by Jsoup, where a DOM tree for an element in a HTML page has arbitrary levels of nesting.
但是我不知道如何通过Jsoup来实现这一目标,在Jsoup中,HTML页面中元素的DOM树具有任意级别的嵌套。
Any help would be appreciated. 任何帮助,将不胜感激。 Thank you.
谢谢。
Use either DOM navigation or CSS-selector syntax to do the task, do not loop through all Elements
. 使用DOM导航或CSS选择器语法来完成任务,不要循环遍历所有
Elements
。
Element adr = doc.select("span.adr").first().
System.out.println(adr.text());
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.