簡體   English   中英

如何在同一元素jsoup中選擇具有相同標簽的子元素?

[英]How to select sub elements with same tag in the same element jsoup?

我需要使用元素標簽divh3a等通過jsoup來解析頁面。我想通過div.g元素來解析並獲取以下類的文本: a class="l _PMs"a class="_pJs"顯示在jList

以Google新聞為例,該頁面如下所示:

<div class="g">
    <div class="ts _JGs _KHs _oGs _KGs _jHs">
        <a class="top _xGs _SHs" href="url" onmousedown="return rwt(this,'','','','1','dfda','','sdfa','','',event)">
            <img class="th _RGs" src="url" alt="Story image" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
        </a>
        <div class="_hJs">
            <h3 class="r _gJs">
                <a class="l _PMs" href="url" onmousedown="return rwt(this,'','','','1','dfs','','sdfa','','',event)">Report on <em>Example</em> Testing<em>Club</em> ...</a>
            </h3>
            <div class="slp">
                <span class="_OHs _PHs">link</span>
                <span class="_QGs">-</span>
                <span class="f nsa _QHs">date</span>
            </div>
            <div class="st">description</div>
        </div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','sdf','','sdfa','','',event)" data-href="url">Final review of <em>example's</em> of <em>testing</em>...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','dfa','','dfs-d','','',event)" data-href="url">Report on this testing
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_eJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','ad','','dfsaf','','',event)">Test report example
            </a>
        </div>
        <div class="_cJs"></div>
    </div>
</div>

<div class="g">
    <div class="ts _JGs _KHs _oGs _KGs _jHs">
        <a class="top _xGs _SHs" href="url" onmousedown="return rwt(this,'','','','1','dfda','','sdfa','','',event)">
            <img class="th _RGs" src="url" alt="Story image" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
        </a>
        <div class="_hJs">
            <h3 class="r _gJs">
                <a class="l _PMs" href="url" onmousedown="return rwt(this,'','','','1','dfs','','sdfa','','',event)">Cloud<em>Example</em> Testing<em>1</em> ...</a>
            </h3>
            <div class="slp">
                <span class="_OHs _PHs">link</span>
                <span class="_QGs">-</span>
                <span class="f nsa _QHs">date</span>
            </div>
            <div class="st">description</div>
        </div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','sdf','','sdfa','','',event)" data-href="url">Final review of this<em>testing</em>...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','dfa','','dfs-d','','',event)" data-href="url">Report on this...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_eJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','ad','','dfsaf','','',event)">Example 2...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="tsw _QMs">
            <div class="_jJs card-section">
                <a class="_MHs" href="url" target="_blank" onmousedown="return rwt(this,'','','','2','sdfs','','dfd','','',event)" data-href="url">
                    <img class="_iJs" id="news-media-image-52779751835836-0" src="url" alt="image1" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
                    <div class="_RMs">USA TODAY.</div>
                </a>
                <a class="_MHs" href="url" target="_blank" onmousedown="return rwt(this,'','','','2','sdfsa','','dsfa','','',event)">
                    <img class="_iJs" id="news-media-image-52779751835836-1" src="url" alt="image2" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
                    <div class="_RMs">image2./div>
                </a>
            </div>
            <div class="_NMs">
                <a class="_OMs" href="url">View all
                </a>
            </div>
        </div>
    </div>
</div>

這是代碼:

String input = txtSearch.getText();
input = input.replace(" ", "+");
String url = "http://www.google.com/search?q=" + input + "&tbm=nws&source=lnms";
try {
    Document doc = Jsoup.connect(url).userAgent("Chrome").timeout(5000).get();
    Elements e = doc.select("div.g");
    DefaultListModel<String> listModel = new DefaultListModel<>();
    e.forEach((e1) -> {
        e1.getElementsByTag("a").forEach(linkElement -> listModel.addElement(linkElement.text()));
    });
    newsList.setModel(listModel);            
} catch (IOException ex) {
    Logger.getLogger(MainUI.class.getName()).log(Level.SEVERE, null, ex);
}

jList中顯示的實際輸出為:

Report on Example Testing Club...  
Final review of example's of testing...  
Report on this testing.  
Test report example.
Cloud Example Testing 1.   
Final review of this testing.   
Report on this...   
Example 2...   
USA TODAY.   
image2.   
View all

我如何選擇這些類:沒有a class=_MHsa class=_OMs a class="l _PMs"a class="_pJs" ,如下所示(在jList ):

Report on Example Testing Club...  
Final review of example's of testing...  
Report on this testing.  
Test report example.
Cloud Example Testing 1.   
Final review of this testing.   
Report on this...   
Example 2...

只需更改此行:

Elements e = doc.select("div.g");

Elements e = doc.select("div.g").select("div.a");

循環中僅檢查文本,例如:

    for(Element element:e)
       {
          yourList.add(e.text());
       }

元素e = doc.select(“ div.g”)。select(“ a”); 我們將列出div.g標簽的每個標簽元素。 因此,現在我們可以通過for循環遍歷每個標簽,並查找文本甚至屬性。

問題是,你選擇所有a給定的內部元素div和調用.text()方法的所有元素的這份名單上-它自然返回你所有的連鎖文字a元素。

為了使代碼按預期工作,您可以更改:

e.forEach((e1) -> {
    listModel.addElement(e1.getElementsByTag("a").text());
});

至:

e.forEach((e1) -> {
    e1.getElementsByTag("a").forEach(linkElement -> listModel.addElement(linkElement.text()));
});

更新

如果只想選擇l + _PMs_pJsa元素,則可以這樣重寫代碼:

Document doc = Jsoup.connect(url).userAgent("Chrome").timeout(5000).get();
DefaultListModel<String> listModel = new DefaultListModel<>();
doc.select("div.g a.l._PMs, div.g a._pJs")
        .forEach(element -> listModel.addElement(element.text()));
newsList.setModel(listModel);            

選擇器為: div.g al_PMs, div.g a._pJs ,這意味着選擇滿足以下條件之一的所有元素:

  • 它們是內部a具有元件l_PMs類是內部div與元件g
  • 它們位於具有_pJsa元素內, a元素位於具有g類的div元素內

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM