简体   繁体   English

如何使用jsoup选择html文档的叶标记

[英]How to select leaf tags of an html document using jsoup

I am using jsoup to parse an html document. 我正在使用jsoup来解析一个html文档。 I need to extract all the child div elements. 我需要提取所有子div元素。 This is basically div tags without nested div tags. 这基本上是没有嵌套div标签的div标签。 I used the following in java to extract div tags, 我在java中使用以下内容来提取div标签,

Elements bodyTag = document.select("div:not(div>div)"); 

Here is an example: 这是一个例子:

<div id="header">
     <div class="container">
         <div id="header-logo"> 
         <a href="/" title="mekay.com">
             <div id="logo">
             </div> </a>
        </div>
        <div id="header-banner">
            <div data-type="ad" data-publisher="lqm.j2ee.site" data-zone="ron">
            </div>
        </div>
     </div>
</div>

I need to extract only the following: 我只需要提取以下内容:

 <div id="logo">
 </div>
 <div data-type="ad" data-publisher="lqm.j2ee.site" data-zone="ron">
 </div>

Instead, the above code snippet is returning all the div tags. 相反,上面的代码片段返回所有div标签。 So, could you please help me figure out what is wrong with this selector 那么,请你帮我弄清楚这个选择器有什么问题

This one is perfectly working 这个是完美的工作

Elements innerMostDivs = doc.select("div:not(:has(div))");

Try it online 在线尝试

  • add your html file 添加你的html文件
  • add css query as div:not(:has(div)) 将css查询添加为div:not(:has(div))
  • check resulted elements 检查结果元素

If you want only div leafs that do not have any children then use this 如果你只想要没有任何孩子的div叶子,那么使用它

Elements emptyDivs = document.select("div:empty");

The selector you are using now means fetch me all the divs that are not direct children of another div . 你现在使用的选择器意味着fetch me all the divs that are not direct children of another div It is normal that it brings the very first parent div, because the div id="header" is not a direct child of a div . 它带来了第一个父div是正常的,因为div id="header"不是div的直接子节点。 Most likely its parent is body . 最有可能的是它的父母是body

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM