简体   繁体   中英

Web Scraping with Java using HTMLUnit

I am trying to web scrape https://www.nba.com/standings#/

Here is my code

What I am trying to use is page.getByXPath("//caption[@class='standings__header']/span")

Which should pull back Eastern Conference and Western Conference but instead it pulls back nothing I don't understand if my Xpath is wrong?

    package Standings;

    import com.fasterxml.jackson.databind.ObjectMapper;
    import com.gargoylesoftware.htmlunit.WebClient;
    import com.gargoylesoftware.htmlunit.html.HtmlElement;
    import com.gargoylesoftware.htmlunit.html.HtmlPage;
    import com.gargoylesoftware.htmlunit.html.HtmlSpan;

    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;

    public class Standings {
          private static final String baseUrl = "https://www.nba.com/standings#/";

        public static void main(String[] args) {
            WebClient client = new WebClient();
            client.getOptions().setJavaScriptEnabled(false);
            client.getOptions().setCssEnabled(false);
            client.getOptions().setUseInsecureSSL(true);
            String jsonString = "";
            ObjectMapper mapper = new ObjectMapper();

            try {
                HtmlPage page = client.getPage(baseUrl);
                System.out.println(page.asXml());

                page.getByXPath("//caption[@class='standings__header']/span")
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

Have used this code to verify your problem:

public static void main(String[] args) throws IOException {
    final String url = "https://www.nba.com/standings#/";

    try (final WebClient webClient = new WebClient()) {
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setUseInsecureSSL(true);

        HtmlPage page = webClient.getPage(url);
        webClient.waitForBackgroundJavaScript(10000);

        System.out.println(page.asXml());
    }
}

When running this i got a bunch of warning and errors in the log.

(BTW: the page produces also many error/warnings when running with real browsers. Seems the maintainer of the page has a interesting view on quality)

I guess the problematic error is this one

TypeError: Cannot modify readonly property: constructor. ( https://www.nba.com/ng/game/main.js#1 )

There is a known bug in the javascript support of HtmlUnit ( https://sourceforge.net/p/htmlunit/bugs/1897/ ). Because the bug is thrown from main.js i guess this will stop the processing of the page javascript before the content you are looking for is generated.

So far i found no time to fix this (looks like this has to be fixed in Rhino) but this one is on the list.

Have a look at https://twitter.com/HtmlUnit to get informed about updates.

The page you are trying to scrape needs Javascript to display properly. If you disable it, most of the elements won't load. Changing the line

client.getOptions().setJavaScriptEnabled(false);

to

client.getOptions().setJavaScriptEnabled(true);

should do the trick

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM