简体   繁体   English

从Web网址检索整页源

[英]Retrieving full page source from web url

I have a minor project that I am working on where I am scraping information from webpages. 我有一个小型项目,我正在研究从网页上抓取信息的地方。 As a beginning step I began looking at the page source from 首先,我开始从

http://www.walmart.com/search/search-ng.do?search_query=camera&ic=16_0&Find=Find&search_constraint=0 http://www.walmart.com/search/search-ng.do?search_query=camera&ic=16_0&Find=Find&search_constraint=0

After analyzing what I needed to do I attempted to retrieve that same page information using two methods that were both unsuccessful 在分析了我需要做的事情之后,我尝试使用两种都不成功的方法来检索相同的页面信息

First I tried a simple request using Jsoup which looks like the following 首先,我使用Jsoup尝试了一个简单的请求,如下所示

    Document doc;
    try {
        doc = Jsoup.connect("http://www.walmart.com/search/search-ng.do?search_query=camera&ic=16_0&Find=Find&search_constraint=0").get();

        System.out.println(doc);

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

Which brought up some page information but not the actual page source which includes all of the search results 该页面显示了一些页面信息,但没有显示包括所有搜索结果的实际页面源

Then I tried and Apache Commons http solution which looks like 然后我尝试和Apache Commons http解决方案看起来像

    String url = "http://www.walmart.com/search/search-ng.do?search_query=camera&ic=16_0&Find=Find&search_constraint=0";
    DefaultHttpClient httpclient = new DefaultHttpClient();
    HttpPost request = new HttpPost(url);

        HttpResponse response;
        try {
            response = httpclient.execute(request);
            StatusLine status = response.getStatusLine();
            String responseString = EntityUtils.toString(response.getEntity());

            System.out.println(status);
            System.out.println(responseString);

        } catch (ClientProtocolException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

but I keep receiving a page permanently moved status. 但我一直收到页面永久移动的状态。

So far it seems that Jsoup is my best option for moving forward. 到目前为止,似乎Jsoup是我前进的最佳选择。 I believe that issue with not receiving all of the search results has to do with the scripts that are on the page not running when called by Jsoup's get function. 我认为未收到所有搜索结果的问题与Jsoup的get函数调用时页面上未运行的脚本有关。

How would I get all of the page information so that I can begin retrieving information from the search results. 我将如何获取所有页面信息,以便可以开始从搜索结果中检索信息。

Jsoup does not support execution of javascript, meaning that you wont be able to parse dynamically generated HTML. Jsoup不支持javascript的执行,这意味着您将无法解析动态生成的HTML。 Simply put, Jsoup does not simulate a browser environment, but is a pure parser. 简而言之,Jsoup不会模拟浏览器环境,而是一个纯解析器。

I would suggest that you instead use HtmlUnit which is a "GUI-less browser for Java programs". 我建议您改用HtmlUnit,它是“用于Java程序的无GUI浏览器”。 It has support for javascript execution, and can be used to generate the HTML source you then later might want to parse easier with Jsoup. 它支持javascript执行,并且可以用于生成HTML源,您以后可能想用Jsoup解析它。

HtmlUnit can be found here . HtmlUnit可以在这里找到。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM