使用Jsoup从HTML页面中的元素中提取文本

Question

I am extracting texts from the following html element 我正在从以下html元素中提取文本

<span class="adr" style="float: none !important;">
     <span class="street-address" style="float: none !important;">18, Jawaharlal Nehru
       Road,
     </span>
     <span  style="float: none !important;" class="estb_addr-HeadingTxt">
       <a style="float: none !important;"   href="http://kolkata.burrp.com/area/park-street" class="locality">&nbsp;Park Street</a></span>
       ,&nbsp;Kolkata<span class="region" style="display: none;">Kolkata
     </span>
</span>

For that I wrote the following piece of code: 为此，我编写了以下代码：

for (Element element : doc.getAllElements()) 
{
        for(Element childelem: element.children())
           {
             if (childelem.hasText() && !childelem.ownText().isEmpty()) 
                {

                     String currText=childelem.ownText();
                     System.out.print(currText+" ");

                  }

            }
         System.out.println("");
 }

Ideally the output should be 18, Jawaharlal Nehru Road, Park Street, Kolkata . 理想情况下，输出应为加尔各答Park Street Jawaharlal Nehru Road 18 。 But it is giving 18, Jawaharlal Nehru Road, Kolkata and Park Street . 但它给了18号路，贾瓦哈拉尔·尼赫鲁路，加尔各答和公园街 。 I can understand that the output is basically inorder traversal of the DOM tree rooted at outer <span>. 我可以理解，输出基本上是对以外部<span>为根的DOM树的有序遍历。 But I don't know exactly how to achieve that by Jsoup, where a DOM tree for an element in a HTML page has arbitrary levels of nesting. 但是我不知道如何通过Jsoup来实现这一目标，在Jsoup中，HTML页面中元素的DOM树具有任意级别的嵌套。

Any help would be appreciated. 任何帮助，将不胜感激。 Thank you. 谢谢。

Answer 1

Use either DOM navigation or CSS-selector syntax to do the task, do not loop through all Elements . 使用DOM导航或CSS选择器语法来完成任务，不要循环遍历所有Elements 。

Element adr = doc.select("span.adr").first().
System.out.println(adr.text());

使用Jsoup从HTML页面中的元素中提取文本

问题描述

1 个解决方案

解决方案1
0 2013-02-27 03:12:40

使用Jsoup从HTML页面中的元素中提取文本

问题描述

1 个解决方案

解决方案1 0 2013-02-27 03:12:40

解决方案1
0 2013-02-27 03:12:40