简体   繁体   中英

Jsoup returning &nbsp for active text fields

So it seems simple but I can't retrieve the text on this web page, and it seems to be changing.

package WorldBoss;


import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.net.MalformedURLException;

public class WorldBoss {

    public static void main(String [] args) throws MalformedURLException {
        Document page = null;
        try {
            page = Jsoup.connect("http://wiki.guildwars2.com/wiki/World_boss").get();
        } catch (IOException e) {
            e.printStackTrace();
        }
        Elements allTimers = page.getElementsByClass("timerjs");
        String firstTime = allTimers.first().html();
        System.out.println(firstTime);
    }
}

It is changing due to it being a countdown.

In the properties on the page it says the innerHTML is correct

在此处输入图片说明

Does anyone know how I can get this information with Jsoup?

The page is here if you want to check it out.

As Pshemo mentioned in the comment, Jsoup is an html parser so it neither renders the page nor executes the scripts on it.

To successfully extract the fields you desire, I have made slight modifications to your code by using the phantomjs driver through selenium. The page is fetched and rendered using phantom and the page source is piped to Jsoup for parsing. Find the code below:

import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.remote.DesiredCapabilities;

public class WorldBoss {

    public static void main(String [] args) {

    WebDriver driver = new PhantomJSDriver(new DesiredCapabilities());
    driver.get("http://wiki.guildwars2.com/wiki/World_boss"); //retrieve page

    //It is very bad to wait explicitly, the best practice is to wait for a specific element on the page e.g. the element you're looking for [1]
    try { // wait to ensure page is loaded and java script is rendered
        Thread.sleep(3 * 1000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    String pageSource = driver.getPageSource();
    Document page = Jsoup.parse(pageSource);
    Elements allTimers = page.getElementsByClass("timerjs");

    for (Element timer : allTimers) {
        //you can get whichever timer you want with it's index
        String firstTime = timer.html().trim();
        if (firstTime.isEmpty()) continue;
        //use timer for whatever you want
        System.out.println(firstTime);
    }
}
}

I used maven so the dependencies in the pom file are:

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.7.2</version>
    </dependency>
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-java</artifactId>
        <version>2.47.1</version>
    </dependency>
    <dependency>
        <groupId>com.github.detro.ghostdriver</groupId>
        <artifactId>phantomjsdriver</artifactId>
        <version>1.0.1</version>
    </dependency>

The code output is:

Active
00:01:33
00:01:33
00:16:33
00:31:33
00:46:33

If you don't have phantomjs installed on your machine, you need to install it for this to work. To install phantom on a debian based box:

sudo apt-get install phantomjs

For other platforms (or to build from source) see how to install phantom .

Hope this helps.

  1. How to wait for elements in selenium

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM