简体   繁体   中英

c# WebBrowser- How can I wait for javascript to finish running that runs when the document has finished loading?

I'm working on a project that involves scraping some product data off of a vendor's web site (with their blessing, but not their help). I'm working in a C# shop, so I'm using the .NET Windows Forms WebBrowser control.

I'm responding to the document completed event, but I'm finding that I have to thread sleep for a little bit, or else the data doesn't show up where I expect it to in the DOM.

In looking at the javascript on the page, I can see that it is dynamically altering the existing DOM content (setting someDomElement.innerHTML) after the page finishes loading. It's not making any ajax calls, it's using data it already has from the original page load. (I could try and parse for that data, but it is embedded in javascript and it's a bit obfuscated.) So evidently I'm somehow getting the document completed event Before the javascript has finished running.

There could eventually be a lot of pages to scrape, so waiting around for a half second or whatever is really far less than ideal. I would like to only wait until all the JavaScript that starts on document ready / page load has finished running before I examine the page. Does anyone know of a way to do that?

I suppose the document completed event shouldn't fire until then, right? But it definitely appears to be. Maybe somewhere the page javascript is using a setTimeout. Is there a way to tell if there any timeouts pending?

Thanks for any help!

You could

  1. Assuming the parsing of the data never change, look at how the Javascript processes the data and do the same on your end to retrieve the data instantly at page load
  2. Inject javascript into the webpage and detect DOM modifications to know when to fetch the data from C#
  3. Write a pure javascript solution with PhantomJS

For posterity / anyone else looking at this, what I ended up doing was creating a function which waits for some specified timeout period for some particular thing (element matching a given set of criteria) to show up on the page, and then returns the HtmlElement of whatever it is. It periodically checks the browser dom, looking for the particular thing. It is intended to be called by a scraper worker which is running in a background thread; it uses an invoke to access the browser dom each time it checks it.

    /// <summary>
    /// Waits for a tag that matches a given criteria to show up on the page.
    /// 
    /// Note: This function returns a browser DOM element from the foreground thread, and this scraper is running in a background thread,
    /// so use an invoke [ scraperForm.Browser.Invoke(new Action(()=>{ ... })); ] when doing anything with the returned DOM element.
    /// </summary>
    /// <param name="tagName">The type of tag, or "" if all tags are to be searched.</param>
    /// <param name="id">The id of the tag, or "" if the search is not to be by id.</param>
    /// <paran name="className">The class name of the tag, or "" if the search is not to be by class name.</paran>
    /// <param name="keyContent">A string to search the tag's innerText for.</param>
    /// <returns>The first tag to match the criteria, or null if such a tag was not found after the timeout period.</returns>
    public HtmlElement WaitForTag(string tagName, string id, string className, string keyContent, int timeout) {
        Log(string.Format("WaitForTag('{0}','{1}','{2}','{3}',{4}) --", tagName, id, className, keyContent, timeout));
        HtmlElement result = null;
        int timeleft = timeout;
        while (timeleft > 0) {
            //Log("time left: " + timeleft);
            // Access the DOM in the foreground thread using an Invoke call.
            // (required by the WebBrowser control, otherwise cryptic errors result, like "invalid cast")
            scraperForm.Browser.Invoke(new Action(() => {
                HtmlDocument doc = scraperForm.CurrentDocument;
                if (id == "") {
                    //Log("no id supplied..");
                    // no id was supplied, so get tags by tag name if a tag name was supplied, or get all the tags
                    HtmlElementCollection elements = (tagName == "") ? doc.All : doc.GetElementsByTagName(tagName);
                    //Log("found " + elements.Count + " '" + tagName + "' tags");
                    // find the tag that matches the class name (if given) and contains the given content (if any)
                    foreach (HtmlElement element in elements) {
                        if (element == null) continue;
                        if (className != "" && !TagHasClass(element, className)) {
                            //Log(string.Format("looking for className {0}, found {1}", className, element.GetAttribute("className")));
                            continue;
                        }
                        if (keyContent == "" || 
                            (element.InnerText != null && element.InnerText.Contains(keyContent)) ||
                            (tagName == "input" && element.GetAttribute("value").Contains(keyContent)) ||
                            (tagName == "img" && element.GetAttribute("src").Contains(keyContent)) || 
                            (element.OuterHtml.Contains(keyContent)))
                        {
                            result = element;
                        }
                        else if (keyContent != "") {
                            //Log(string.Format("searching for key content '{0}' - found '{1}'", keyContent, element.InnerText));
                        }
                    }
                }
                else {
                    //Log(string.Format("searching for tag by id '{0}'", id));
                    // an id was supplied, so get the tag by id 
                    // Log("looking for element with id [" + id + "]");
                    HtmlElement element = doc.GetElementById(id);
                    // make sure it matches any given class name and contains any given content
                    if (
                        element != null 
                        && 
                        (className == "" || TagHasClass(element, className))
                        && 
                        (keyContent == "" || 
                            (element.InnerText != null && element.InnerText.Contains(keyContent))
                        )
                    ) {
                        // Log("  found it");
                        result = element;
                    }
                    else {
                        // Log("  didn't find it");
                    }
                }
            }));
            if (result != null) break;   // the searched for tag appeared, break out of the loop 
            Thread.Sleep(200);           // wait for more milliseconds and continue looping 
            // Note: Make sure sleeps like this are outside of invokes to the foreground thread, so they only pause this background thread.
            timeleft -= 200;
        }
        return result;
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM