c＃WebBrowser-如何在文档加载完成后等待javascript完成运行？

Question

I'm working on a project that involves scraping some product data off of a vendor's web site (with their blessing, but not their help). 我正在开展一个项目，涉及从供应商的网站上抓取一些产品数据（有他们的祝福，但不是他们的帮助）。 I'm working in a C# shop, so I'm using the .NET Windows Forms WebBrowser control. 我在C＃商店工作，所以我使用.NET Windows Forms WebBrowser控件。

I'm responding to the document completed event, but I'm finding that I have to thread sleep for a little bit, or else the data doesn't show up where I expect it to in the DOM. 我正在响应文档已完成的事件，但我发现我必须稍微调试一下，否则数据不会显示在我期望它在DOM中的位置。

In looking at the javascript on the page, I can see that it is dynamically altering the existing DOM content (setting someDomElement.innerHTML) after the page finishes loading. 在查看页面上的javascript时，我可以看到它在页面加载完成后动态地改变现有的DOM内容（设置someDomElement.innerHTML）。 It's not making any ajax calls, it's using data it already has from the original page load. 它没有进行任何ajax调用，它使用的是原始页面加载中已有的数据。 (I could try and parse for that data, but it is embedded in javascript and it's a bit obfuscated.) So evidently I'm somehow getting the document completed event Before the javascript has finished running. （我可以尝试解析该数据，但它嵌入在javascript中并且有点混淆。）显然，我以某种方式获取文档已完成事件在javascript运行完毕之前。

There could eventually be a lot of pages to scrape, so waiting around for a half second or whatever is really far less than ideal. 最终可能会有很多页面要刮掉，所以等待半秒钟或其他什么东西真的远远不够理想。 I would like to only wait until all the JavaScript that starts on document ready / page load has finished running before I examine the page. 我想只等到所有在文档就绪/页面加载时启动的JavaScript在我检查页面之前完成运行。 Does anyone know of a way to do that? 有谁知道这样做的方法？

I suppose the document completed event shouldn't fire until then, right? 我想文件完成事件不应该在那之前开火，对吗？ But it definitely appears to be. 但它肯定是。 Maybe somewhere the page javascript is using a setTimeout. 也许某个页面javascript正在使用setTimeout。 Is there a way to tell if there any timeouts pending? 有没有办法判断是否有待处理的超时？

Thanks for any help! 谢谢你的帮助！

Answer 1

You could 你可以

Assuming the parsing of the data never change, look at how the Javascript processes the data and do the same on your end to retrieve the data instantly at page load 假设数据的解析永远不会改变，请查看Javascript如何处理数据并在您的端点执行相同操作以在页面加载时立即检索数据
Inject javascript into the webpage and detect DOM modifications to know when to fetch the data from C# 将javascript注入网页并检测DOM修改以了解何时从C＃获取数据
Write a pure javascript solution with PhantomJS 用PhantomJS编写一个纯粹的JavaScript解决方案

Answer 2

For posterity / anyone else looking at this, what I ended up doing was creating a function which waits for some specified timeout period for some particular thing (element matching a given set of criteria) to show up on the page, and then returns the HtmlElement of whatever it is. 对于后代/其他任何看过这个问题的人来说，我最终做的是创建一个函数，它等待某些特定事物的某个指定的超时时间（匹配给定的一组标准）显示在页面上，然后返回HtmlElement无论是什么。 It periodically checks the browser dom, looking for the particular thing. 它定期检查浏览器dom，寻找特定的东西。 It is intended to be called by a scraper worker which is running in a background thread; 它旨在由在后台线程中运行的刮刀工作者调用; it uses an invoke to access the browser dom each time it checks it. 它每次检查时都使用一个调用来访问浏览器dom。

    /// <summary>
    /// Waits for a tag that matches a given criteria to show up on the page.
    /// 
    /// Note: This function returns a browser DOM element from the foreground thread, and this scraper is running in a background thread,
    /// so use an invoke [ scraperForm.Browser.Invoke(new Action(()=>{ ... })); ] when doing anything with the returned DOM element.
    /// </summary>
    /// <param name="tagName">The type of tag, or "" if all tags are to be searched.</param>
    /// <param name="id">The id of the tag, or "" if the search is not to be by id.</param>
    /// <paran name="className">The class name of the tag, or "" if the search is not to be by class name.</paran>
    /// <param name="keyContent">A string to search the tag's innerText for.</param>
    /// <returns>The first tag to match the criteria, or null if such a tag was not found after the timeout period.</returns>
    public HtmlElement WaitForTag(string tagName, string id, string className, string keyContent, int timeout) {
        Log(string.Format("WaitForTag('{0}','{1}','{2}','{3}',{4}) --", tagName, id, className, keyContent, timeout));
        HtmlElement result = null;
        int timeleft = timeout;
        while (timeleft > 0) {
            //Log("time left: " + timeleft);
            // Access the DOM in the foreground thread using an Invoke call.
            // (required by the WebBrowser control, otherwise cryptic errors result, like "invalid cast")
            scraperForm.Browser.Invoke(new Action(() => {
                HtmlDocument doc = scraperForm.CurrentDocument;
                if (id == "") {
                    //Log("no id supplied..");
                    // no id was supplied, so get tags by tag name if a tag name was supplied, or get all the tags
                    HtmlElementCollection elements = (tagName == "") ? doc.All : doc.GetElementsByTagName(tagName);
                    //Log("found " + elements.Count + " '" + tagName + "' tags");
                    // find the tag that matches the class name (if given) and contains the given content (if any)
                    foreach (HtmlElement element in elements) {
                        if (element == null) continue;
                        if (className != "" && !TagHasClass(element, className)) {
                            //Log(string.Format("looking for className {0}, found {1}", className, element.GetAttribute("className")));
                            continue;
                        }
                        if (keyContent == "" || 
                            (element.InnerText != null && element.InnerText.Contains(keyContent)) ||
                            (tagName == "input" && element.GetAttribute("value").Contains(keyContent)) ||
                            (tagName == "img" && element.GetAttribute("src").Contains(keyContent)) || 
                            (element.OuterHtml.Contains(keyContent)))
                        {
                            result = element;
                        }
                        else if (keyContent != "") {
                            //Log(string.Format("searching for key content '{0}' - found '{1}'", keyContent, element.InnerText));
                        }
                    }
                }
                else {
                    //Log(string.Format("searching for tag by id '{0}'", id));
                    // an id was supplied, so get the tag by id 
                    // Log("looking for element with id [" + id + "]");
                    HtmlElement element = doc.GetElementById(id);
                    // make sure it matches any given class name and contains any given content
                    if (
                        element != null 
                        && 
                        (className == "" || TagHasClass(element, className))
                        && 
                        (keyContent == "" || 
                            (element.InnerText != null && element.InnerText.Contains(keyContent))
                        )
                    ) {
                        // Log("  found it");
                        result = element;
                    }
                    else {
                        // Log("  didn't find it");
                    }
                }
            }));
            if (result != null) break;   // the searched for tag appeared, break out of the loop 
            Thread.Sleep(200);           // wait for more milliseconds and continue looping 
            // Note: Make sure sleeps like this are outside of invokes to the foreground thread, so they only pause this background thread.
            timeleft -= 200;
        }
        return result;
    }

c＃WebBrowser-如何在文档加载完成后等待javascript完成运行？

问题描述

2 个解决方案

解决方案1
2 2013-12-19 23:03:53

解决方案2
2 2017-04-03 19:13:02

c＃WebBrowser-如何在文档加载完成后等待javascript完成运行？

问题描述

2 个解决方案

解决方案1 2 2013-12-19 23:03:53

解决方案2 2 2017-04-03 19:13:02

解决方案1
2 2013-12-19 23:03:53

解决方案2
2 2017-04-03 19:13:02