简体   繁体   English

Jsoup 无法获取带有嵌套标签的外部 HTML

[英]Jsoup fails to get outer HTML with nested tags

after I connect to an Instagram page using Jsoup, I want to extract the whole outer html from a Tag.使用 Jsoup 连接到 Instagram 页面后,我想从标签中提取整个外部 html。 Somehow when I inspect the page and copy the outer html from the tag I get loads of lines, while I get only few using Jsoup (somehow the html of the nested tags gets ignored) Any help would be appreciated how to get the whole html!不知何故,当我检查页面并从标签复制外部 html 时,我得到了很多行,而我只得到了很少的使用 Jsoup(不知何故,嵌套标签的 html 被忽略了)如何获取整个 html 的任何帮助将不胜感激!

Code:代码:

Document doc = Jsoup.connect("https://www.instagram.com/myUsername").get();

Element link = doc.selectFirst("span");
String linkOuter = link.outerHtml();
System.out.println(linkOuter);

Output :输出

<span id="react-root">
  <svg width="50" height="50" viewbox="0 0 50 50" 
  style="position:absolute;top:50%;left:50%;margin:-25px 0 0 
  -25px;fill:#c7c7c7">
    <path d="M25 1c-6.52 0-7.34.03-9.9.14-2.55.12-4.3.53-5.82..." />
</svg></span>

Image of the structure:结构图: 结构

EDIT: I wwant that the whole HTML of the span tag gets saved (I want the same result with HtmlUnit/Jsoup as when I right click on the tag click on edit html and then right click-> copy outer html!编辑:我希望保存 span 标签的整个 HTML(我希望 HtmlUnit/Jsoup 的结果与我右键单击标签时的结果相同,然后单击编辑 html,然后右键单击-> 复制外部 html!

Unfortunately, Instagram is a Web app built with javascript framework react.不幸的是,Instagram 是一个使用 JavaScript 框架 react 构建的 Web 应用程序。 That means, that final HTML is not returned from the server, but rather it's generated by javascript on client side on a browser after the initial loading of the page.这意味着,最终的 HTML 不是从服务器返回的,而是在页面初始加载后由浏览器客户端的 javascript 生成的。

To see HTML generated by react you'd need to evaluate javascript code, which is returned from the server.要查看由 react 生成的 HTML,您需要评估从服务器返回的 javascript 代码。 JSoup is simple HTML parser and can't evaluate js, so you'd have to use another library, like for example HtmlUnit. JSoup是简单的 HTML 解析器,无法评估 js,因此您必须使用另一个库,例如 HtmlUnit。

For example:例如:

WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true); // enable javascript
webClient.getOptions().setThrowExceptionOnScriptError(false); //even if there is error in js continue
HtmlPage page = webClient.getPage(new URL("https://www.instagram.com/myUsername"));
webClient.waitForBackgroundJavaScript(5000); // important! wait when javascript finishes rendering

page.getElementById("react-root");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM