繁体   English   中英

如何在同一元素jsoup中选择具有相同标签的子元素?

[英]How to select sub elements with same tag in the same element jsoup?

我需要使用元素标签divh3a等通过jsoup来解析页面。我想通过div.g元素来解析并获取以下类的文本: a class="l _PMs"a class="_pJs"显示在jList

以Google新闻为例,该页面如下所示:

<div class="g">
    <div class="ts _JGs _KHs _oGs _KGs _jHs">
        <a class="top _xGs _SHs" href="url" onmousedown="return rwt(this,'','','','1','dfda','','sdfa','','',event)">
            <img class="th _RGs" src="url" alt="Story image" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
        </a>
        <div class="_hJs">
            <h3 class="r _gJs">
                <a class="l _PMs" href="url" onmousedown="return rwt(this,'','','','1','dfs','','sdfa','','',event)">Report on <em>Example</em> Testing<em>Club</em> ...</a>
            </h3>
            <div class="slp">
                <span class="_OHs _PHs">link</span>
                <span class="_QGs">-</span>
                <span class="f nsa _QHs">date</span>
            </div>
            <div class="st">description</div>
        </div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','sdf','','sdfa','','',event)" data-href="url">Final review of <em>example's</em> of <em>testing</em>...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','dfa','','dfs-d','','',event)" data-href="url">Report on this testing
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_eJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','ad','','dfsaf','','',event)">Test report example
            </a>
        </div>
        <div class="_cJs"></div>
    </div>
</div>

<div class="g">
    <div class="ts _JGs _KHs _oGs _KGs _jHs">
        <a class="top _xGs _SHs" href="url" onmousedown="return rwt(this,'','','','1','dfda','','sdfa','','',event)">
            <img class="th _RGs" src="url" alt="Story image" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
        </a>
        <div class="_hJs">
            <h3 class="r _gJs">
                <a class="l _PMs" href="url" onmousedown="return rwt(this,'','','','1','dfs','','sdfa','','',event)">Cloud<em>Example</em> Testing<em>1</em> ...</a>
            </h3>
            <div class="slp">
                <span class="_OHs _PHs">link</span>
                <span class="_QGs">-</span>
                <span class="f nsa _QHs">date</span>
            </div>
            <div class="st">description</div>
        </div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','sdf','','sdfa','','',event)" data-href="url">Final review of this<em>testing</em>...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','dfa','','dfs-d','','',event)" data-href="url">Report on this...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_eJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','ad','','dfsaf','','',event)">Example 2...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="tsw _QMs">
            <div class="_jJs card-section">
                <a class="_MHs" href="url" target="_blank" onmousedown="return rwt(this,'','','','2','sdfs','','dfd','','',event)" data-href="url">
                    <img class="_iJs" id="news-media-image-52779751835836-0" src="url" alt="image1" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
                    <div class="_RMs">USA TODAY.</div>
                </a>
                <a class="_MHs" href="url" target="_blank" onmousedown="return rwt(this,'','','','2','sdfsa','','dsfa','','',event)">
                    <img class="_iJs" id="news-media-image-52779751835836-1" src="url" alt="image2" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
                    <div class="_RMs">image2./div>
                </a>
            </div>
            <div class="_NMs">
                <a class="_OMs" href="url">View all
                </a>
            </div>
        </div>
    </div>
</div>

这是代码:

String input = txtSearch.getText();
input = input.replace(" ", "+");
String url = "http://www.google.com/search?q=" + input + "&tbm=nws&source=lnms";
try {
    Document doc = Jsoup.connect(url).userAgent("Chrome").timeout(5000).get();
    Elements e = doc.select("div.g");
    DefaultListModel<String> listModel = new DefaultListModel<>();
    e.forEach((e1) -> {
        e1.getElementsByTag("a").forEach(linkElement -> listModel.addElement(linkElement.text()));
    });
    newsList.setModel(listModel);            
} catch (IOException ex) {
    Logger.getLogger(MainUI.class.getName()).log(Level.SEVERE, null, ex);
}

jList中显示的实际输出为:

Report on Example Testing Club...  
Final review of example's of testing...  
Report on this testing.  
Test report example.
Cloud Example Testing 1.   
Final review of this testing.   
Report on this...   
Example 2...   
USA TODAY.   
image2.   
View all

我如何选择这些类:没有a class=_MHsa class=_OMs a class="l _PMs"a class="_pJs" ,如下所示(在jList ):

Report on Example Testing Club...  
Final review of example's of testing...  
Report on this testing.  
Test report example.
Cloud Example Testing 1.   
Final review of this testing.   
Report on this...   
Example 2...

只需更改此行:

Elements e = doc.select("div.g");

Elements e = doc.select("div.g").select("div.a");

循环中仅检查文本,例如:

    for(Element element:e)
       {
          yourList.add(e.text());
       }

元素e = doc.select(“ div.g”)。select(“ a”); 我们将列出div.g标签的每个标签元素。 因此,现在我们可以通过for循环遍历每个标签,并查找文本甚至属性。

问题是,你选择所有a给定的内部元素div和调用.text()方法的所有元素的这份名单上-它自然返回你所有的连锁文字a元素。

为了使代码按预期工作,您可以更改:

e.forEach((e1) -> {
    listModel.addElement(e1.getElementsByTag("a").text());
});

至:

e.forEach((e1) -> {
    e1.getElementsByTag("a").forEach(linkElement -> listModel.addElement(linkElement.text()));
});

更新

如果只想选择l + _PMs_pJsa元素,则可以这样重写代码:

Document doc = Jsoup.connect(url).userAgent("Chrome").timeout(5000).get();
DefaultListModel<String> listModel = new DefaultListModel<>();
doc.select("div.g a.l._PMs, div.g a._pJs")
        .forEach(element -> listModel.addElement(element.text()));
newsList.setModel(listModel);            

选择器为: div.g al_PMs, div.g a._pJs ,这意味着选择满足以下条件之一的所有元素:

  • 它们是内部a具有元件l_PMs类是内部div与元件g
  • 它们位于具有_pJsa元素内, a元素位于具有g类的div元素内

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM