简体繁体 English

Jsoup获取动态生成的HTML

[英]Jsoup get dynamically generated HTML

原文 2014-03-13 20:57:13 2 1 java/ javascript/ parsing/ jsoup

I can connect to most sites and get the HTML just fine but when trying to connect to a website where most of the content is generated after the initial page load with JavaScript, it does not get any of that data. 我可以连接到大多数站点，并且可以很好地获取HTML，但是当尝试连接到使用JavaScript初始页面加载后生成大部分内容的网站时，它不会获取任何数据。 Is there any way to do this with Jsoup or does it not support it? 有什么办法可以用Jsoup做到这一点，或者它不支持它吗？

1 个解决方案

JSoup has some basic connection handling included, but it is not a web browser. JSoup包含一些基本的连接处理，但它不是Web浏览器。 It excels at parsing static html content. 它擅长解析静态html内容。 It does not run any javascript, so you are out of luck. 它不运行任何JavaScript，因此您很不走运。 However, there are different options that you might follow: 但是，您可能会遵循不同的选项：

You can analyze the page that you want to retrieve and find out how the content you are interested in gets loaded. 您可以分析要检索的页面，并找出您感兴趣的内容是如何加载的。 Often it is not very hard to tap the original source of the loaded content and work with this. 通常，点击加载内容的原始来源并使用它并不难。 This approach has the benefit that you get what you want with no need of extra libraries and the retrieval will be fast. 这种方法的好处是无需额外的库即可获得所需内容，并且检索速度很快。
You can use a (full) browser and automate the loading of the page. 您可以使用（完整的）浏览器来自动执行页面的加载。 A very good tool for this is selenium webdriver in combination with the headless webkit browser phantomjs . 一个非常好的工具是将硒webdriver与无头webkit浏览器phantomjs结合使用。 This however requires extra software and extra libraries in your project and will run much much slower than the first solution. 但是，这需要您项目中的其他软件和库，并且运行速度会比第一个解决方案慢得多。