简体   繁体   中英

JSoup get all elements based on class

I am writing a web scraper using JSoup to take prices from the first page of search results on Amazon. For example, you search "hammer" on amazon, the first page of search results comes up, my scraper takes all the prices for each search result and shows them. However, I can't figure out why nothing is printed when I run my program. The HTML for the price figure of an item on Amazon.ca is:

<a class="a-link-normal a-text-normal" href="http://www.amazon.ca/Stanley-51-624-Fiberglass-Hammer-20-Ounce/dp/B000VSSG2K/ref=sr_1_1?ie=UTF8&amp;qid=1436274467&amp;sr=8-1&amp;keywords=hammer"><span class="a-size-base a-color-price s-price a-text-bold">CDN$ 17.52</span></a>

I run my code as follows:

Elements prices = doc.getElementsByClass("a-size-base a-color-price s-price a-text-bold");
System.out.println("Prices: " + prices);

What is returned:

Prices: 

How do I get the price value "CDN$ 17.52" in this case?

One way would be doc.select("span.s-price") , another would be doc.getElementsByClass("s-price") .

Your code doesn't work because getElementsByClass expects a single class name, and returns all elements which have that class. You've supplied several class names, the function can't cope with that and finds nothing.

The span element you're looking for has several classes applied to it: a-size-base , a-color-price , s-price and a-text-bold . You can look for any one of these classes, and it's also possible to match elements which have all four classes by building a CSS selector like doc.select(".a-size-base.a-color-price.s-price.a-text-bold") .

However, you probably want as simple a selector as possible, because Amazon are free to change their CSS styles at any time and can easily break your scraper.

The simpler the scraper is, the more resilient it is to breakage. You might want to look for prices through semantics rather than rendered style, eg doc.getElementsContainingOwnText("CDN$") would select elements containing the literal text "CDN$".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM