Web Scraping with Java using HTMLUnit

Question

I am trying to web scrape https://www.nba.com/standings#/

Here is my code

What I am trying to use is page.getByXPath("//caption[@class='standings__header']/span")

Which should pull back Eastern Conference and Western Conference but instead it pulls back nothing I don't understand if my Xpath is wrong?

    package Standings;

    import com.fasterxml.jackson.databind.ObjectMapper;
    import com.gargoylesoftware.htmlunit.WebClient;
    import com.gargoylesoftware.htmlunit.html.HtmlElement;
    import com.gargoylesoftware.htmlunit.html.HtmlPage;
    import com.gargoylesoftware.htmlunit.html.HtmlSpan;

    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;

    public class Standings {
          private static final String baseUrl = "https://www.nba.com/standings#/";

        public static void main(String[] args) {
            WebClient client = new WebClient();
            client.getOptions().setJavaScriptEnabled(false);
            client.getOptions().setCssEnabled(false);
            client.getOptions().setUseInsecureSSL(true);
            String jsonString = "";
            ObjectMapper mapper = new ObjectMapper();

            try {
                HtmlPage page = client.getPage(baseUrl);
                System.out.println(page.asXml());

                page.getByXPath("//caption[@class='standings__header']/span")
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

Answer 1

Have used this code to verify your problem:

public static void main(String[] args) throws IOException {
    final String url = "https://www.nba.com/standings#/";

    try (final WebClient webClient = new WebClient()) {
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setUseInsecureSSL(true);

        HtmlPage page = webClient.getPage(url);
        webClient.waitForBackgroundJavaScript(10000);

        System.out.println(page.asXml());
    }
}

When running this i got a bunch of warning and errors in the log.

(BTW: the page produces also many error/warnings when running with real browsers. Seems the maintainer of the page has a interesting view on quality)

I guess the problematic error is this one

TypeError: Cannot modify readonly property: constructor. ( https://www.nba.com/ng/game/main.js#1 )

There is a known bug in the javascript support of HtmlUnit ( https://sourceforge.net/p/htmlunit/bugs/1897/ ). Because the bug is thrown from main.js i guess this will stop the processing of the page javascript before the content you are looking for is generated.

So far i found no time to fix this (looks like this has to be fixed in Rhino) but this one is on the list.

Have a look at https://twitter.com/HtmlUnit to get informed about updates.

Answer 2

The page you are trying to scrape needs Javascript to display properly. If you disable it, most of the elements won't load. Changing the line

client.getOptions().setJavaScriptEnabled(false);

to

client.getOptions().setJavaScriptEnabled(true);

should do the trick

Web Scraping with Java using HTMLUnit

Question

2 answers

solution1
1 ACCPTED 2019-01-11 17:58:58

solution2
0 2019-01-09 19:54:32

Web Scraping with Java using HTMLUnit

Question

2 answers

solution1 1 ACCPTED 2019-01-11 17:58:58

solution2 0 2019-01-09 19:54:32

solution1
1 ACCPTED 2019-01-11 17:58:58

solution2
0 2019-01-09 19:54:32