简体   繁体   English

为什么使用Jsoup解析网站时的HTML代码与使用浏览器时HTML代码不同

[英]Why HTML code is different when parsing site using Jsoup than using browser

I am on the website http://www.flashscore.com/nhl/ and I am trying to extract the links of the 'Today's Matches' table. 我在网站http://www.flashscore.com/nhl/上,并尝试提取“今日比赛”表的链接。

I am trying it with the following code, but it does not work Can you point out where the mistake is? 我正在尝试使用以下代码,但无法正常工作。您能指出错误在哪里吗?

  final Document page = Jsoup
    .connect("http://d.flashscore.com/x/feed/t_4_200_G2Op923t_1_en_1")
    .cookie("_ga","GA1.2.47011772.1485726144")
    .referrer("http://d.flashscore.com/x/feed/proxy-local")
    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36")
    .header("X-Fsign", "SW9D1eZo")
    .header("X-GeoIP", "1")
    .header("X-Requested-With", "XMLHttpRequest")
    .header("Accept" , "*/*")
    .get();

for (Element game : page.select("table.hockey tr")) {
Elements links = game.getElementsByClass("tr-first stage-finished");
for (Element link : links) {
    String linkHref = link.attr("href");
    String linkText = link.text();
}
 }

To try to fix it I started to debug it. 为了尝试修复它,我开始对其进行调试。 It shows that we get the page (althouh we are getting kind of a strange HTML). 它表明我们得到了页面(虽然我们得到的是一种奇怪的HTML)。 After that the debugging showed that the for loop does not even start. 之后,调试显示for循环甚至没有启动。 I was trying to change the page.select("") part to different ones (like getElementByAttribute etc.), but I have just started to learn web scraping, so I need to get familiar with those methods to navigate through a document. 我试图将page.select(“”)部分更改为其他部分(例如getElementByAttribute等),但是我刚刚开始学习网络抓取,因此我需要熟悉那些方法来浏览文档。 How am I supposed to extract this data? 我应该如何提取这些数据?

As said in comments, this website need to execute some Javascript in order to build that linkable elements. 如评论所述,该网站需要执行一些Javascript才能构建可链接的元素。 Jsoup only parse HTML, it doesn't run any JS and you won't see same HTML source if you get from a browser or if you get from Jsoup . Jsoup仅解析HTML,它不运行任何JS,如果从浏览器或Jsoup获取,则不会看到相同的HTML源。

You need to get the website as if you were running it on a real browser. 您需要像在真实的浏览器上一样运行网站。 You can do that programatically using WebDriver and Firefox . 您可以使用WebDriverFirefox以编程方式进行操作。

I've tried with your example site and works: 我已经尝试过您的示例网站并可以运行:

pom.xml pom.xml

<project>

<modelVersion>4.0.0</modelVersion>
<groupId>com.test</groupId>
<artifactId>test</artifactId>
<version>1.0-SNAPSHOT</version>
<build>
  <plugins>
    <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <configuration>
      <source>1.8</source>
      <target>1.8</target>
      </configuration>
    </plugin>
  </plugins>
</build>
<packaging>jar</packaging>

<name>test</name>
<url>http://maven.apache.org</url>

<properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

<dependencies>
  <dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-firefox-driver</artifactId>
    <version>2.43.0</version>
  </dependency>
</dependencies>

</project>

App.java App.java

package com.test;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import java.util.Collections;
import java.util.List;
import java.util.stream.Collectors;

public class App {

public static void main( String[] args ) {
    App app = new App();
    List<String> links = app.parseLinks();
    links.forEach(System.out::println);
}

public List<String> parseLinks() {
    try {
        WebDriver driver ;
        // should download geckodriver https://github.com/mozilla/geckodriver/releases and set according your local file
        System.setProperty("webdriver.firefox.marionette","C:\\apps\\geckodriver.exe");
        driver = new FirefoxDriver();
        String baseUrl = "http://www.flashscore.com/nhl/";

        driver.get(baseUrl);

        return driver.findElement(By.className("hockey"))
                .findElements(By.tagName("tr"))
                .stream()
                .distinct()
                .filter(we -> !we.getAttribute("id").isEmpty())
                .map(we -> createLink(we.getAttribute("id")))
                .collect(Collectors.toList());

    } catch (Exception e) {
        e.printStackTrace();
        return Collections.EMPTY_LIST;
    }
}

private String createLink(String id) {
    return String.format("http://www.flashscore.com/match/%s/#match-summary", extractId(id));
}

private String extractId(String id) {
    if (id.contains("x_4_")) {
        id = id.replace("x_4_","");
    } else if (id.contains("g_4_")) {
        id = id.replace("g_4_","");
    }

    return id;
}
}

Output: 输出:

http://www.flashscore.com/match/f9MJJI69/#match-summary
http://www.flashscore.com/match/zZCyd0dC/#match-summary
http://www.flashscore.com/match/drEXdts6/#match-summary
http://www.flashscore.com/match/EJOScMRa/#match-summary
http://www.flashscore.com/match/0GKOb2Cg/#match-summary
http://www.flashscore.com/match/6gLKarcm/#match-summary
...
...

PS: Working using Firefox version 32.0 and Selenium 2.43.0. PS:使用Firefox版本32.0和Selenium 2.43.0。 It's a common error to use unsupported version between Selenium and Firefox. 在Selenium和Firefox之间使用不受支持的版本是一个常见错误。

You are getting wrong address in .connect("http://d.flashscore.com/x/feed/t_4_200_G2Op923t_1_en_1") - you need to use .connect("http://www.flashscore.com/nhl/") in there. 您在.connect("http://d.flashscore.com/x/feed/t_4_200_G2Op923t_1_en_1")中收到错误的地址-您需要使用.connect("http://www.flashscore.com/nhl/")在那里。

Then, this site uses JS and after you'll get the right page -it will be rendered differently than in browser, eg there won't be a table with class 'hockey'. 然后,该站点使用JS,在您获得正确的页面之后-它的呈现方式将与浏览器中呈现的方式不同,例如,将不会有一个类为“曲棍球”的表。 You'l see it in the page you'll get. 您会在获得的页面中看到它。 So, you'll need to change locators. 因此,您需要更改定位器。 Or consider using WebDriver for this. 或考虑为此使用WebDriver

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM