简体   繁体   中英

Screen scraping web page after delay

I'm trying to scrape a web page using C#, however after the page loads, it executes some JavaScript which loads more elements into the DOM which I need to scrape. A standard scraper simply grabs the html of the page on load and doesn't pick up the DOM changes made via JavaScript. How do I put in some sort of functionality to wait for a second or two and then grab the source?

Here is my current code:

private string ScrapeWebpage(string url, DateTime? updateDate)
{
    HttpWebRequest request = null;
    HttpWebResponse response = null;
    Stream responseStream = null;
    StreamReader reader = null;
    string html = null;
    try
    {
        //create request (which supports http compression)
        request = (HttpWebRequest)WebRequest.Create(url);
        request.Pipelined = true;
        request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
        if (updateDate != null)
            request.IfModifiedSince = updateDate.Value;
        //get response.
        response = (HttpWebResponse)request.GetResponse();
        responseStream = response.GetResponseStream();
        if (response.ContentEncoding.ToLower().Contains("gzip"))
            responseStream = new GZipStream(responseStream,
                CompressionMode.Decompress);
        else if (response.ContentEncoding.ToLower().Contains("deflate"))
            responseStream = new DeflateStream(responseStream,
                CompressionMode.Decompress);
        //read html.
        reader = new StreamReader(responseStream, Encoding.Default);
        html = reader.ReadToEnd();
    }
    catch
    {
        throw;
    }
    finally
    {
        //dispose of objects.
        request = null;
        if (response != null)
        {
            response.Close();
            response = null;
        }
        if (responseStream != null)
        {
            responseStream.Close();
            responseStream.Dispose();
        }
        if (reader != null)
        {
            reader.Close();
            reader.Dispose();
        }
    }
    return html;
}

Here's a sample URL:

http://www.realtor.com/realestateandhomes-search/geneva_ny#listingType-any/pg-4

You'll see when the page first loads it says 134 listings found, then after a second it says 187 properties found.

To execute the JavaScript I use webkit to render the page, which is the engine used by Chrome and Safari. Here is an example using its Python bindings.

Webkit also has .NET bindings but I haven't used them.

The approach you have will not work regardless how long you wait, you need a browser to execute the javascript (or something that understands javascript).

Try this question: What's a good tool to screen-scrape with Javascript support?

You would need to execute the javascript yourself to get this functionality. Currently, your code only receives whatever the server replies with at the URL you request. The rest of the listings are "showing up" because the browser downloads, parses, and executes the accompanying javascript.

The answer to this similar question says to use a web browser control to read the page in and process it before scraping it. Perhaps with some kind of timer delay to give the javascript some time to execute and return results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM