简体   繁体   中英

Jsoup select is not fetching all elements

Please see below image. The elements before red arrow are loaded but 4 elements after it are not for some reason.

在此处输入图片说明

The way I'm selecting these elements is,

doc = Jsoup.connect(url).header("Accept-Encoding", "gzip, deflate").userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36").maxBodySize(0).timeout(600000).get();

Elements detailsBuyBoxContainer = doc.select("li[class^=product-tile]");
System.out.println(detailsBuyBoxContainer.size());

Also tried using below select

/*Elements detailsBuyBoxContainer = doc.getElementsByAttributeValueContaining("class",
"details-buy-box-container");*/

The size printed should be 24 and not 20

The list is partly filled by client side JavaScript, ie AJAX calles. JSoup does not run Javascript and is not a browser, so the naive approach you were trying can't work.

I see two solutions:

A) Use Selenium webdriver, which is a real browser and will load the AJAX stuff fine.

B) Identify the AJAX calls yourself and use JSoup to directly call the Api url. Interpreting this is often not that hard, although you may have to use different scraping techniques, like interpreting JSON instead of HTML.

Addendum

I looked into the tesco site a bit more and it seems they use a somewhat funny approach of sending JSON responses which contain HTML. I guess that saves some JavaScript work on the client, but it is still a bit strange. Oh well. Here is the call I captured using a browser network tab. When you scroll down the list, an Ajax call is made to http://www.tesco.com/direct/blocks/catalog/productlisting/infiniteBrowse.jsp?&view=grid&catId=4294967294+4294814304&sortBy=&searchquery=espresso+machine&offset=20&lazyload=true

It seems that the offset parameter is the one you need to increase to get more results back. The contents of such a call is a JSON object containing two properties: "products" and "variants". The products property seems to contain the html.

So step by step:

1) use Jsoup (or for example Apache HttpClient) to get the raw contents of the Ajax call:

Connection con = Jsoup.connect("http://www.tesco.com/direct/blocks/catalog/productlisting/infiniteBrowse.jsp?&view=grid&catId=4294967294+4294814304&sortBy=&searchquery=espresso+machine&offset=20&lazyload=true")
            .ignoreContentType(true);    
Response res = con.execute();
String rawJSON = res.body();

2) Parse the JSON with a library to your liking. I usually use Json-Simple

JSONObject o = (JSONObject) JSONValue.parse(html);
String html = (String) o.get("products");

Note, that JSON-Simple is easy to use, but does not use generics. You may want to look into Gson as well, of Jackson.

3) Parse the html with JSoup:

Document doc = Jsoup.parse(html);

Inspect the code of the web site and seek the ajax function that generate the elements you miss, then you have to place the function's call in your Jsoup.connect function!

This could help you!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM