I have a html file that contains many of the following code blocks:
<div class="f-icon m-item " data-ctrdot="60055294621">
<div class="item-main util-clearfix">
<div class="content">
<div class="cwrap">
<div class="cleft">
<div class="lwrap">
<h2 class="title"><a href="http://www.alibaba.com/product-detail/Sunnytex-Best-Selling-wind-proof-Soft_60055294621.html?s=p" title="Sunnytex Best Selling wind proof Soft Shell Winter Black Wool Coat" data-hislog="60055294621" data-pid="60055294621" data-domdot="id:2678,pid:60055294621,ext:'|n=2|s=p|t={{attr target}}'" target="_blank" data-p4plog="60055294621">Sunnytex Best Selling wind proof Soft Shell Winter Black Wool Coat</a> </h2>
<div class="attr">
US $23.5-24.8 /
<em>Piece</em>
<em>( FOB Price)</em>
</div>
<div class="attr">
500 Pieces
<em>(Min. Order)</em>
</div>
<div class="kv-prop util-clearfix">
<div class="kv" title="Product Type: Coats">
Product Type:
<b>Coats</b>
</div>
<div class="kv" title="Age Group: Adults">
Age Group:
<b>Adults</b>
</div>
.... (many other stuff not shown here)
</div>
</div>
</div>
</div> (end)
I want to extract all the links like "http://www.alibaba.com/product-detail/Custom-3D-Made-Printed-Blank-Hoodies_60081368914.html?s=p"
.
I wrote:
Document doc = Jsoup.connect(catUrl).get();
Elements products = doc.select("div.f-icon m-item").select("h2.title").select("a[href]");
for(Element prodUrl: products){
System.out.println(prodUrl.html());
itemUrls.addItem(prodUrl.html());
}
So basically I want to put all the product page urls into a hashset called itemUrls, but it seems that there's nothing in products
. Jsoup.connect(catUrl).get()
works fine and can return the web page to me, but the select
method doesn't seem to work. Any input will be greatly appreciated. Thanks.
Spaces are used to describe ancestor child
relationship, so div.f-icon m-item
would represent div
with f-icon
class, and it would try to find m-item
element in it.
In other words doc.select("div.f-icon m-item")
is same as doc.select("div.f-icon").select("m-item")
which can find only something like
<div class="f-icon">
...
<m-item>...</m-item>
...
</div>
which is not what you want.
If you want to select element with two classes use element.class1.class2
syntax.
So instead of
doc.select("div.f-icon m-item").select("h2.title").select("a[href]")
you can write it as
doc.select("div.f-icon.m-item h2.title a[href]")
// ^^^^^^^^^^^^^^^^^ div with two classes "f-icon" and "m-item"
Next thing is that prodUrl.html()
will return you text which is used as representation of link like foo
in <a href="google.com">
foo .
What you seem to want is value of href
attribute. To do this use prodUrl.attr("href")
.
So your code can look more or less like
Document doc = Jsoup.connect(catUrl).get();
Elements products = doc.select("div.f-icon.m-item h2.title a[href]");
for(Element prodUrl: products){
System.out.println(prodUrl.attr("href"));
itemUrls.addItem(prodUrl.attr("href"));
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.