简体   繁体   English

Jsoup返回活动文本字段

[英]Jsoup returning &nbsp for active text fields

So it seems simple but I can't retrieve the text on this web page, and it seems to be changing. 因此,看起来似乎很简单,但我无法在此网页上检索文本,并且它似乎正在发生变化。

package WorldBoss;


import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.net.MalformedURLException;

public class WorldBoss {

    public static void main(String [] args) throws MalformedURLException {
        Document page = null;
        try {
            page = Jsoup.connect("http://wiki.guildwars2.com/wiki/World_boss").get();
        } catch (IOException e) {
            e.printStackTrace();
        }
        Elements allTimers = page.getElementsByClass("timerjs");
        String firstTime = allTimers.first().html();
        System.out.println(firstTime);
    }
}

It is changing due to it being a countdown. 由于它是倒计时,所以它正在改变。

In the properties on the page it says the innerHTML is correct 在页面的属性中,它说innerHTML是正确的

在此处输入图片说明

Does anyone know how I can get this information with Jsoup? 有谁知道我如何通过Jsoup获得此信息?

The page is here if you want to check it out. 如果您想签出该页面,请点击此处

As Pshemo mentioned in the comment, Jsoup is an html parser so it neither renders the page nor executes the scripts on it. 正如Pshemo在评论中提到的那样,Jsoup是一个html解析器,因此它既不呈现页面也不在页面上执行脚本。

To successfully extract the fields you desire, I have made slight modifications to your code by using the phantomjs driver through selenium. 为了成功提取所需的字段,我通过硒使用phantomjs驱动程序对您的代码进行了一些修改。 The page is fetched and rendered using phantom and the page source is piped to Jsoup for parsing. 使用幻像获取和呈现页面,并将页面源通过管道传递给Jsoup进行解析。 Find the code below: 查找下面的代码:

import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.remote.DesiredCapabilities;

public class WorldBoss {

    public static void main(String [] args) {

    WebDriver driver = new PhantomJSDriver(new DesiredCapabilities());
    driver.get("http://wiki.guildwars2.com/wiki/World_boss"); //retrieve page

    //It is very bad to wait explicitly, the best practice is to wait for a specific element on the page e.g. the element you're looking for [1]
    try { // wait to ensure page is loaded and java script is rendered
        Thread.sleep(3 * 1000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    String pageSource = driver.getPageSource();
    Document page = Jsoup.parse(pageSource);
    Elements allTimers = page.getElementsByClass("timerjs");

    for (Element timer : allTimers) {
        //you can get whichever timer you want with it's index
        String firstTime = timer.html().trim();
        if (firstTime.isEmpty()) continue;
        //use timer for whatever you want
        System.out.println(firstTime);
    }
}
}

I used maven so the dependencies in the pom file are: 我使用了maven,因此pom文件中的依赖项为:

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.7.2</version>
    </dependency>
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-java</artifactId>
        <version>2.47.1</version>
    </dependency>
    <dependency>
        <groupId>com.github.detro.ghostdriver</groupId>
        <artifactId>phantomjsdriver</artifactId>
        <version>1.0.1</version>
    </dependency>

The code output is: 代码输出为:

Active
00:01:33
00:01:33
00:16:33
00:31:33
00:46:33

If you don't have phantomjs installed on your machine, you need to install it for this to work. 如果您的计算机上未安装phantomjs,则需要安装它才能正常工作。 To install phantom on a debian based box: 要在基于debian的盒子上安装phantom:

sudo apt-get install phantomjs

For other platforms (or to build from source) see how to install phantom . 对于其他平台(或从源代码构建), 请参阅如何安装phantom

Hope this helps. 希望这可以帮助。

  1. How to wait for elements in selenium 如何等待硒中的元素

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM