简体   繁体   English

java html解析器,用于读取javascript生成的内容

[英]java html parser for reading javascript generated contents

I am using jsoup for reading a web page by the following function. 我正在使用jsoup通过以下功能读取网页。

public Document getDocuement(String url){
        Document doc = null;
        try {
            doc = Jsoup.connect(url).timeout(20*1000).userAgent("Mozilla").get();
        } catch (Exception e) {
            return null;
        }
        return doc;
    }

But whenever i am trying to read a web page that contain javascript generated contents, jsoup does not read those contents. 但是,每当我尝试阅读包含javascript生成的内容的网页时, jsoup都不阅读那些内容。 ie, the actual content of the page is loading by some javascript calls.So it is not present in the page source of that link. 也就是说,页面的实际内容是通过一些javascript调用加载的。因此,该链接的页面源中不存在该内容。 For example, this blog: http://blog.rapporter.net/search/label/r . 例如,此博客: http : //blog.rapporter.net/search/label/r Is there a way to get also javascript generated content when parsing page with Jsoup ? 使用Jsoup解析页面时,是否还有办法获取javascript生成的内容? If no please suggest any java html parser that can solve this problem.. 如果否,请建议可以解决此问题的任何Java HTML解析器。

You cannot do this with Jsoup . 您无法使用Jsoup做到这一点 Jsoup parses HTML, to wait for AJAX requests or JavaScript content in general you would need a browser which could execute this JavaScript in order to get some output from it. Jsoup解析HTML,通常要等待AJAX​​请求或JavaScript内容,您将需要一个可以执行此JavaScript的浏览器以便从中获取一些输出。 JavaScript logic can be complex, so executing JavaScript and loading content is not a trivial thing (just take a look at how complicated browsers, JS and the DOM are). JavaScript逻辑可能很复杂,因此执行JavaScript和加载内容并不是一件容易的事(只需看看浏览器,JS和DOM有多复杂)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM