简体   繁体   中英

Java jsoup select contents

I have a html file that contains many of the following code blocks:

<div class="f-icon m-item " data-ctrdot="60055294621"> 
 <div class="item-main util-clearfix"> 
  <div class="content"> 
   <div class="cwrap"> 
    <div class="cleft"> 
     <div class="lwrap"> 
      <h2 class="title"><a href="http://www.alibaba.com/product-detail/Sunnytex-Best-Selling-wind-proof-Soft_60055294621.html?s=p" title="Sunnytex Best Selling wind proof Soft Shell Winter Black Wool Coat" data-hislog="60055294621" data-pid="60055294621" data-domdot="id:2678,pid:60055294621,ext:'|n=2|s=p|t={{attr target}}'" target="_blank" data-p4plog="60055294621">Sunnytex Best Selling wind proof Soft Shell Winter Black Wool Coat</a> </h2> 
      <div class="attr">
        US $23.5-24.8 / 
       <em>Piece</em> 
       <em>( FOB Price)</em> 
      </div> 
      <div class="attr">
        500 Pieces 
       <em>(Min. Order)</em> 
      </div> 
      <div class="kv-prop util-clearfix"> 
       <div class="kv" title="Product Type: Coats">
        Product Type: 
        <b>Coats</b>
       </div> 
       <div class="kv" title="Age Group: Adults">
        Age Group: 
        <b>Adults</b>
       </div> 
       .... (many other stuff not shown here)
       </div> 
      </div> 
     </div> 
    </div> (end)

I want to extract all the links like "http://www.alibaba.com/product-detail/Custom-3D-Made-Printed-Blank-Hoodies_60081368914.html?s=p" .

I wrote:

Document doc = Jsoup.connect(catUrl).get();
Elements products = doc.select("div.f-icon m-item").select("h2.title").select("a[href]");
for(Element prodUrl: products){
    System.out.println(prodUrl.html());
    itemUrls.addItem(prodUrl.html());
}

So basically I want to put all the product page urls into a hashset called itemUrls, but it seems that there's nothing in products . Jsoup.connect(catUrl).get() works fine and can return the web page to me, but the select method doesn't seem to work. Any input will be greatly appreciated. Thanks.

Spaces are used to describe ancestor child relationship, so div.f-icon m-item would represent div with f-icon class, and it would try to find m-item element in it.

In other words doc.select("div.f-icon m-item") is same as doc.select("div.f-icon").select("m-item") which can find only something like

<div class="f-icon">
   ...
     <m-item>...</m-item>
   ...
</div>

which is not what you want.

If you want to select element with two classes use element.class1.class2 syntax.

So instead of

doc.select("div.f-icon m-item").select("h2.title").select("a[href]")

you can write it as

doc.select("div.f-icon.m-item h2.title a[href]")
//          ^^^^^^^^^^^^^^^^^ div with two classes "f-icon" and "m-item"

Next thing is that prodUrl.html() will return you text which is used as representation of link like foo in <a href="google.com"> foo .

What you seem to want is value of href attribute. To do this use prodUrl.attr("href") .

So your code can look more or less like

Document doc = Jsoup.connect(catUrl).get();
Elements products = doc.select("div.f-icon.m-item h2.title a[href]");
for(Element prodUrl: products){
    System.out.println(prodUrl.attr("href"));
    itemUrls.addItem(prodUrl.attr("href"));
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM