简体   繁体   English

如何使用 JSoup 从 Sports Reference 的数据表中检索数据?

[英]How to retrieve data from data table from Sports Reference using JSoup?

I'm attempting to use JSoup to retrieve the amount of wins for a team from a Sports Reference table.我正在尝试使用 JSoup 从 Sports Reference 表中检索团队的获胜次数。

Specifically, I am trying to receive the following data point highlighted below, with the html code provided具体来说,我试图接收下面突出显示的以下数据点,并提供了 html 代码

Below is what I have tried already, but I get a null pointer exception when trying to access the text of this element, telling me that my code is likely not parsing the HTML code correctly.下面是我已经尝试过的内容,但是在尝试访问此元素的文本时出现空指针异常,这告诉我我的代码可能没有正确解析 HTML 代码。

Element wins = document.selectFirst("td[data-stat=\\"wins\\"]");

What I want is for the text of this element to be 34 (or some number depending on the number of wins for the team).我想要的是这个元素的文本是 34(或一些数字,取决于团队的获胜次数)。

Check what your Document was able to read from page and print it .检查您的文档能够从页面读取的内容并打印出来 If it contains HTML content which can be dynamically added by JavaScript by browser, you need to use as tool Selenium not Jsoup.如果它包含可由浏览器通过 JavaScript 动态添加的 HTML 内容,则需要使用 Selenium 而不是 Jsoup 作为工具。

For reading HTML source , you can write similar to: 对于阅读 HTML 源代码,您可以编写类似于:

import java.io.IOException;
import org.jsoup.Jsoup;

public class JSoupHTMLSourceEx {
    public static void main(String[] args) throws IOException {
        String webPage = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
        String html = Jsoup.connect(webPage).get().html();
        System.out.println(html);
    }
}

Since Jsoup supports cssSelector , you can try to get an element like:由于 Jsoup 支持cssSelector ,您可以尝试获取如下元素:

public static void main(String[] args)  {
        String webPage = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
        String html = Jsoup.connect(webPage).get().html();

Document document = Jsoup.parse(html);
    Elements tds = document.select("#team_misc > tbody > tr:nth-child(1) > td:nth-child(2)");
        for (Element e : tds) {
            System.out.println(e.text());
        }
}

But better solution is to use Selenium - a portable framework for testing web applications ( more details about Selenium tool ):但更好的解决方案是使用Selenium - 一个用于测试 Web 应用程序的可移植框架(有关 Selenium 工具的更多详细信息):

public static void main(String[] args) {
    String baseUrl = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
    WebDriver driver = new FirefoxDriver();

    driver.get(baseUrl);
    String innerText = driver.findElement(
        By.xpath("//*[@id="team_misc"]/tbody/tr[1]/td[1]")).getText();  
        System.out.println(innerText); 
    driver.quit();
    }
}

Also you can try instead of:您也可以尝试代替:

driver.findElement(By.xpath("//*[@id="team_misc"]/tbody/tr[1]/td[1]")).getText(); 

in this form :以这种形式

driver.findElement(By.xpath("//[@id="team_misc"]/tbody/tr[1]/td[1]")).getAttribute("innerHTML");

PS In the future it would be useful to add source links from where you want to get information or at least snippet of the DOM structure instead of image. PS 将来,添加源链接会很有用,您可以从中获取信息或至少是 DOM 结构的片段而不是图像。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM