简体   繁体   English

获取完整的HTML内容网页(包括javascript内容)

[英]Get FULL HTML content web page (including javascript content)

After some hours of trying and reading, I'm a bit lost about the title subject. 经过数小时的尝试和阅读,我对标题主题有些迷失。

My problem : I am trying to get the full HTML content (javascript HTML appended/added content) of a single web page. 我的问题:我正在尝试获取单个网页的完整HTML内容(JavaScript HTML附加/添加的内容)。 What I have already try : 我已经尝试了什么:

  • I used Jsoup, but I had to change because of the fact that jsoup doesn't handle javascript content. 我使用了Jsoup,但由于jsoup无法处理javascript内容,因此不得不进行更改。
  • I used HmtlUtil but I get many errors on the loading of the targeted webpage (like Css error, runtimeError, EcmaError, etc.) 我使用了HmtlUtil,但是在加载目标网页时遇到很多错误(例如Css错误,runtimeError,EcmaError等)
  • I used the basic functionnality of Chrome to save the full content webpage and then I used the Jsoup library to get the content i wanted to find. 我使用Chrome的基本功能来保存完整的内容网页,然后使用Jsoup库获取想要查找的内容。 This is the only way I could have get the content I wish get. 这是我可以获得希望的内容的唯一方法。

So now, the question is, how can I imitate the "save as" function of a browser or how can I, in general, get the full HTML content first AND then use Jsoup to scan the static final HTML content ? 因此,现在的问题是,如何模仿浏览器的“另存为”功能,或者通常如何首先获得完整的HTML内容, 然后使用Jsoup扫描静态最终HTML内容?

Thanks a lot for your advise and your help ! 非常感谢您的建议和帮助!

I finally get what i wanted to. 我终于得到了我想要的。 I will try to explain for thoose who need some help! 我将尝试为那些需要帮助的人解释一下!


So ! 所以! The process is composed by two steps : 该过程由两个步骤组成:

  • First, get the final content HTML (including javascript HTML content, etc.) like if you were visiting the web page and then save it to a simply file.html 首先,获得最终内容的HTML(包括javascript HTML内容等),就像您访问网页一样,然后将其保存到简单的file.html中。
  • Then, we are going to use the Jsoup library to get the wanted content in the saved file, file.hmtl . 然后,我们将使用Jsoup库在保存的文件file.hmtl中获取所需内容。

1 - Get HTML content and save it 1-获取HTML内容并保存

For this step, you will need to download phantomjs and use it to get the content. 对于此步骤,您将需要下载phantomjs并使用它来获取内容。 Here is the code to get the target page. 这是获取目标页面的代码。 Just change myTargetedPage.com by the URL of the page you want to get and the name of the file mySaveFile.html . 只是你想要得到的网页的URL和文件mySaveFile.html的名称更改myTargetedPage.com。

var page = require('webpage').create();
var fs = require('fs');
page.open('http://myTargetedPage.com', function () {
    page.evaluate();
    fs.write('mySaveFile.html', page.content, 'w');
    phantom.exit();
});

As you can see, the file saved is exactly the same as the content load in your browser. 如您所见,保存的文件与浏览器中加载的内容完全相同。

2 - Extract the content you wanted 2-提取您想要的内容

Now, we will use Java and the library Jsoup to get or specific content. 现在,我们将使用Java和库Jsoup来获取特定内容。 in my example, I want to get this part of the web page : 在我的示例中,我想获得网页的这一部分:

/* HTML CONTENT */
<span class="my class" data="data1"></span>
/* HTML CONTENT */
<span class="my class" data="data2"></span>
/* HTML CONTENT */

To get this, this code will be fine (don't forget to edit thePathToYourSavedFile.html : 要做到这一点,可以使用以下代码(不要忘记编辑thePathToYourSavedFile.html

public static void main(String[] args) throws Exception {
    String url = "thePathToYourSavedFile.html";

    Document document = Jsoup.connect(url).userAgent("Mozilla").get();

    Elements spanList= document.select("span");

   for (Element span: spanList) {
       if(span.attr("class").equals("my class")){
           String data = span.attr("data");
           System.out.println("data : "+data);             
       }
    }       
}

Enjoy ! 请享用 !

There is a nice plugin that gives you what you are looking for. 有一个不错的插件,可以为您提供所需的内容。 It offers a way to see a page and it's functionality. 它提供了一种查看页面及其功能的方法。 It is available for some of the browsers but not all. 它适用于某些浏览器,但不是全部。 Here is the link : http://chrispederick.com/work/web-developer/ 这是链接: http : //chrispederick.com/work/web-developer/

PS after you install it, there is a little gear on the toolbar located at the top right. 安装PS后,右上角的工具栏上会有一个小齿轮。 That is where all the functions is at. 那就是所有功能所在的地方。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM