[英]not able to fetch some class from jsoup api
Hi I am new to jsoup and trying to scrape data from following link, 嗨,我是jsoup的新手,正在尝试从以下链接中抓取数据,
https://www.zomato.com/ahmedabad/mcdonalds-navrangpura https://www.zomato.com/ahmedabad/mcdonalds-navrangpura
but I'm not able to get data for the following class : rev-text 但我无法获取以下课程的数据:rev-text
This is my code: 这是我的代码:
public class Test {
public static void main(String[] args) throws IOException {
Document doc;
doc = Jsoup.connect("https://www.zomato.com/ahmedabad/mcdonalds-navrangpura").userAgent("Chrome/41.0.2228.0").get();
// get page title
String title = doc.title();
System.out.println("title : " + title);
// get all links
Elements links = doc.getElementsByClass("rev-text");
/* Elements links = doc.getAllElements();*/
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link);
System.out.println("text : " + link.text());
}
}
}
Please guide me on how to do this. 请指导我如何执行此操作。
Problem Background 问题背景
The rev-text element is not a part of the "default" page source, it is dynamically loaded using JavaScript. rev-text元素不是“默认”页面源的一部分,它是使用JavaScript动态加载的。 Since Jsoup is not a browser simulator, it doesn't execute the script on the page it just gives you the source.
由于Jsoup不是浏览器模拟器,因此它不会在页面上执行脚本,而只是向您提供源代码。
A simple way to test the source retrieved is to print it out; 测试检索到的源的一种简单方法是将其打印出来。 you will see that the rev-text class is not present at all.
您将看到根本没有rev-text类。
System.out.println(doc.html()); //print out page source
Proposed Solution 拟议的解决方案
Generally to scrape content from web pages that are JavaScript heavy it's usually useful to use a tool that can simulate a browser by executing the scripts on the page. 通常,要从JavaScript繁重的网页上抓取内容,通常使用一种可以通过执行页面上的脚本来模拟浏览器的工具很有用。 A common library that does this is Selenium .
Selenium是执行此操作的常见库。 You can use the PhantomJS (you can readup on this) driver in selenium, fetch the page, pass the page source to Jsoup and extract the rev-text.
您可以在硒中使用PhantomJS (您可以在其上阅读)驱动程序,获取页面,将页面源传递给Jsoup并提取rev-text。
Here is a sample code that uses selenium to extract the fields you need: 以下是使用硒提取所需字段的示例代码:
public static void main(String[] args) throws IOException, InterruptedException {
WebDriver driver = new PhantomJSDriver(new DesiredCapabilities());
driver.get("https://www.zomato.com/ahmedabad/mcdonalds-navrangpura"); //retrieve page with selenium
Thread.sleep(3*1000); //bad idea, wait for specific element. e.g rev-text class instead of using sleep[1].
Document doc = Jsoup.parse(driver.getPageSource());
driver.quit(); //quit webdriver
// get page title
String title = doc.title();
System.out.println("title : " + title);
// get all links with rev-text class
Elements links = doc.getElementsByClass("rev-text");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link);
System.out.println("text : " + link.text());
}
}
}
You will need to add the selenium libraries to your class path. 您将需要将硒库添加到您的类路径中。 I'm using maven so all i added was:
我正在使用maven,所以我添加的只是:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>2.45.0</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-remote-driver</artifactId>
<version>2.45.0</version>
</dependency>
<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version>1.2.1</version>
</dependency>
This works fine for me and extracts the reviews in the page. 这对我来说效果很好,并提取了页面中的评论。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.