简体   繁体   English

Jsoup.body()返回空主体

[英]Jsoup.body() returns empty body

I am trying to get HTML's body content but it returns me an empty body only to this specific site, what can it be? 我正在尝试获取HTML的正文内容,但是它仅向该特定网站返回一个空的正文,这是什么?

Document doc = Jsoup
            .connect("http://givatram.ort.org.il/%D7%9C%D7%95%D7%97-%D7%A9%D7%99%D7%A0%D7%95%D7%99%D7%99-%D7%9E%D7%A2%D7%A8%D7%9B%D7%AA/")
            .userAgent(
                    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36")
            .timeout(0).followRedirects(true).execute().parse();
    Elements titles = doc.select(".entrytitle");

    System.out.println(doc.body());

I could reproduce your problem. 我可以重现您的问题。 If I check the entire document with System.out.println(doc) then I can see that there is a script in the head tag, which indicates that it does connect to the site. 如果使用System.out.println(doc)检查整个文档,则可以看到head标记中有一个脚本,表明该脚本确实已连接到该站点。 According to this answer Jsoup is only a pure HTML parser and doesn't execute Javascript. 根据此答案, Jsoup只是纯HTML解析器,不执行Javascript。 Maybe the content of the site is loaded via Javascript and that is why the body is empty? 也许网站的内容是通过Javascript加载的,这就是为什么正文为空的原因?

Edit 1: 编辑1:

I could verify this. 我可以验证一下。 If I use ui4j , a small wrapper for the JavaFx Browser, I can see the body: 如果我使用ui4j (JavaFx浏览器的小型包装器),则可以看到主体:

BrowserEngine browser = BrowserFactory.getWebKit();
Page page = browser.navigate("http://givatram.ort.org.il/%D7%9C%D7%95%D7%97-%D7%A9%D7%99%D7%A0%D7%95%D7%99%D7%99-%D7%9E%D7%A2%D7%A8%D7%9B%D7%AA/");
System.out.println(page.getDocument().getBody());

So it seems like what you are trying to do is unfortunately not possible with JSoup. 因此,很不幸,您似乎想用JSoup做不到的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM