[英]Get FULL HTML content web page (including javascript content)
After some hours of trying and reading, I'm a bit lost about the title subject. 经过数小时的尝试和阅读,我对标题主题有些迷失。
My problem : I am trying to get the full HTML content (javascript HTML appended/added content) of a single web page. 我的问题:我正在尝试获取单个网页的完整HTML内容(JavaScript HTML附加/添加的内容)。 What I have already try :
我已经尝试了什么:
So now, the question is, how can I imitate the "save as" function of a browser or how can I, in general, get the full HTML content first AND then use Jsoup to scan the static final HTML content ? 因此,现在的问题是,如何模仿浏览器的“另存为”功能,或者通常如何首先获得完整的HTML内容, 然后使用Jsoup扫描静态最终HTML内容?
Thanks a lot for your advise and your help ! 非常感谢您的建议和帮助!
I finally get what i wanted to. 我终于得到了我想要的。 I will try to explain for thoose who need some help!
我将尝试为那些需要帮助的人解释一下!
So ! 所以! The process is composed by two steps :
该过程由两个步骤组成:
1 - Get HTML content and save it 1-获取HTML内容并保存
For this step, you will need to download phantomjs and use it to get the content. 对于此步骤,您将需要下载phantomjs并使用它来获取内容。 Here is the code to get the target page.
这是获取目标页面的代码。 Just change myTargetedPage.com by the URL of the page you want to get and the name of the file mySaveFile.html .
只是你想要得到的网页的URL和文件mySaveFile.html的名称更改myTargetedPage.com。
var page = require('webpage').create();
var fs = require('fs');
page.open('http://myTargetedPage.com', function () {
page.evaluate();
fs.write('mySaveFile.html', page.content, 'w');
phantom.exit();
});
As you can see, the file saved is exactly the same as the content load in your browser. 如您所见,保存的文件与浏览器中加载的内容完全相同。
2 - Extract the content you wanted 2-提取您想要的内容
Now, we will use Java and the library Jsoup to get or specific content. 现在,我们将使用Java和库Jsoup来获取特定内容。 in my example, I want to get this part of the web page :
在我的示例中,我想获得网页的这一部分:
/* HTML CONTENT */
<span class="my class" data="data1"></span>
/* HTML CONTENT */
<span class="my class" data="data2"></span>
/* HTML CONTENT */
To get this, this code will be fine (don't forget to edit thePathToYourSavedFile.html : 要做到这一点,可以使用以下代码(不要忘记编辑thePathToYourSavedFile.html :
public static void main(String[] args) throws Exception {
String url = "thePathToYourSavedFile.html";
Document document = Jsoup.connect(url).userAgent("Mozilla").get();
Elements spanList= document.select("span");
for (Element span: spanList) {
if(span.attr("class").equals("my class")){
String data = span.attr("data");
System.out.println("data : "+data);
}
}
}
Enjoy ! 请享用 !
There is a nice plugin that gives you what you are looking for. 有一个不错的插件,可以为您提供所需的内容。 It offers a way to see a page and it's functionality.
它提供了一种查看页面及其功能的方法。 It is available for some of the browsers but not all.
它适用于某些浏览器,但不是全部。 Here is the link : http://chrispederick.com/work/web-developer/
这是链接: http : //chrispederick.com/work/web-developer/
PS after you install it, there is a little gear on the toolbar located at the top right. 安装PS后,右上角的工具栏上会有一个小齿轮。 That is where all the functions is at.
那就是所有功能所在的地方。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.