简体   繁体   English

JSoup根据类获取所有元素

[英]JSoup get all elements based on class

I am writing a web scraper using JSoup to take prices from the first page of search results on Amazon. 我正在使用JSoup编写网络抓取工具,以从亚马逊搜索结果的第一页获取价格。 For example, you search "hammer" on amazon, the first page of search results comes up, my scraper takes all the prices for each search result and shows them. 例如,您在亚马逊上搜索“锤子”,出现搜索结果的第一页,我的刮板获取每个搜索结果的所有价格并显示出来。 However, I can't figure out why nothing is printed when I run my program. 但是,我无法弄清楚为什么在运行程序时什么也没打印出来。 The HTML for the price figure of an item on Amazon.ca is: Amazon.ca上商品价格图的HTML为:

<a class="a-link-normal a-text-normal" href="http://www.amazon.ca/Stanley-51-624-Fiberglass-Hammer-20-Ounce/dp/B000VSSG2K/ref=sr_1_1?ie=UTF8&amp;qid=1436274467&amp;sr=8-1&amp;keywords=hammer"><span class="a-size-base a-color-price s-price a-text-bold">CDN$ 17.52</span></a>

I run my code as follows: 我按如下方式运行我的代码:

Elements prices = doc.getElementsByClass("a-size-base a-color-price s-price a-text-bold");
System.out.println("Prices: " + prices);

What is returned: 返回什么:

Prices: 

How do I get the price value "CDN$ 17.52" in this case? 在这种情况下,如何获得价格值“ CDN $ 17.52”?

One way would be doc.select("span.s-price") , another would be doc.getElementsByClass("s-price") . 一种方法是doc.select("span.s-price") ,另一种方法是doc.getElementsByClass("s-price")

Your code doesn't work because getElementsByClass expects a single class name, and returns all elements which have that class. 您的代码不起作用,因为getElementsByClass 需要一个类名,并返回其具有类的所有元素。 You've supplied several class names, the function can't cope with that and finds nothing. 您提供了几个类名,该函数无法解决该问题,但未找到任何内容。

The span element you're looking for has several classes applied to it: a-size-base , a-color-price , s-price and a-text-bold . 您要查找的span元素已应用了多个类: a-size-basea-color-prices-pricea-text-bold You can look for any one of these classes, and it's also possible to match elements which have all four classes by building a CSS selector like doc.select(".a-size-base.a-color-price.s-price.a-text-bold") . 您可以查找这些类中的任何一个,也可以通过构建CSS选择器(如doc.select(".a-size-base.a-color-price.s-price.a-text-bold")来匹配具有所有四个类的元素doc.select(".a-size-base.a-color-price.s-price.a-text-bold")

However, you probably want as simple a selector as possible, because Amazon are free to change their CSS styles at any time and can easily break your scraper. 但是,您可能想要一个尽可能简单的选择器,因为亚马逊可以随时自由更改其CSS样式,并且很容易破坏您的抓取工具。

The simpler the scraper is, the more resilient it is to breakage. 刮具越简单,断裂的弹性就越大。 You might want to look for prices through semantics rather than rendered style, eg doc.getElementsContainingOwnText("CDN$") would select elements containing the literal text "CDN$". 您可能希望通过语义而不是呈现的样式查找价格,例如doc.getElementsContainingOwnText("CDN$")将选择包含文字文本“ CDN $”的元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM