简体   繁体   English

javascript操作后获取HtmlDocument

[英]Get HtmlDocument after javascript manipulations

In C#, using the System.Windows.Forms.HtmlDocument class (or another class that allows DOM parsing), is it possible to wait until a webpage finishes its javascript manipulations of the HTML before retrieving that HTML?在 C# 中,使用 System.Windows.Forms.HtmlDocument 类(或另一个允许 DOM 解析的类),是否可以等到网页完成对 HTML 的 javascript 操作后再检索该 HTML? Certain sites add innerhtml to pages through javascript, but those changes do not show up when I parse the HtmlElements of the HtmlDocument.某些站点通过 javascript 将 innerhtml 添加到页面,但是当我解析 HtmlDocument 的 HtmlElements 时,这些更改不会显示。

One possibility would be to update the HtmlDocument of the page after a second.一种可能性是在一秒钟后更新页面的 HtmlDocument。 Does anybody know how to do this?有人知道怎么做这个吗?

Someone revived this question by posting what I think is an incorrect answer.有人通过发布我认为不正确的答案来重新提出这个问题。 So, here are my thoughts to address it.所以,这里是我的想法来解决它。

Non-deterministically, it's possible to get close to finding out if the page has finished its AJAX stuff.非确定性地,有可能接近找出页面是否已完成其 AJAX 内容。 However, it completely depends on the logic of that particular page: some pages are perpetually dynamic.但是,这完全取决于该特定页面的逻辑:有些页面是永久动态的。

To approach this, one can handle DocumentCompleted event first, then asynchronously poll the WebBrowser.IsBusy property and monitor the current HTML snapshot of the page for changes, like below.为了解决这个问题,可以先处理DocumentCompleted事件,然后异步轮询WebBrowser.IsBusy属性并监视页面的当前 HTML 快照以进行更改,如下所示。

The complete sample can be found here .可以在此处找到完整的示例。

// get the root element
var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];

// poll the current HTML for changes asynchronosly
var html = documentElement.OuterHtml;
while (true)
{
    // wait asynchronously, this will throw if cancellation requested
    await Task.Delay(500, token); 

    // continue polling if the WebBrowser is still busy
    if (this.webBrowser.IsBusy)
        continue; 

    var htmlNow = documentElement.OuterHtml;
    if (html == htmlNow)
        break; // no changes detected, end the poll loop

    html = htmlNow;
}

In general aswer is "no" - unless script on the page notifies your code in some way you have to simply wait some time and grab HTML.一般来说,aswer 是“否” - 除非页面上的脚本以某种方式通知您的代码,否则您必须等待一段时间并获取 HTML。 Waiting a second after document ready notification likley will cover most sites (ie jQuery's $(code) cases).在文档就绪通知之后等待一秒钟,可能会覆盖大多数站点(即 jQuery 的$(code)案例)。

You need to give the application a second to process the Java.您需要给应用程序一些时间来处理 Java。 Simply halting the current thread will delay the java processing as well so your doc will still come up outdated.简单地停止当前线程也会延迟 java 处理,因此您的文档仍然会过时。

WebBrowserDocumentCompletedEventArgs cachedLoadArgs;

private void TimerDone(object sender, EventArgs e)
{
    ((Timer)sender).Stop();
    respondToPageLoaded(cachedLoadArgs);
}

void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    cachedLoadArgs = e;

    System.Windows.Forms.Timer timer = new Timer();

    int interval = 1000;

    timer.Interval = interval;
    timer.Tick += new EventHandler(TimerDone);
    timer.Start();
}

I made with WEbBrowser take a look at my class:我用 WEBBrowser 来看看我的班级:

public class MYCLASSProduct: IProduct
{
    public string Name { get; set; }
    public double Price { get; set; }
    public string Url { get; set; }

    private WebBrowser _WebBrowser;
    private AutoResetEvent _lock;

    public void Load(string url)
    {
        _lock = new AutoResetEvent(false);
        this.Url = url;

        browserInitializeBecauseJavascriptLoadThePage();
    }

    private void browserInitializeBecauseJavascriptLoadThePage()
    {
        _WebBrowser = new WebBrowser();
        _WebBrowser.DocumentCompleted += webBrowser_DocumentCompleted;
        _WebBrowser.Dock = DockStyle.Fill;
        _WebBrowser.Name = "webBrowser";
        _WebBrowser.ScrollBarsEnabled = false;
        _WebBrowser.TabIndex = 0;
        _WebBrowser.Navigate(Url);

        Form form = new Form();
        form.Hide();
        form.Controls.Add(_WebBrowser);

        Application.Run(form);
        _lock.WaitOne();
    }

    private void webBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
    {
        HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
        hDocument.LoadHtml(_WebBrowser.Document.Body.OuterHtml);
        this.Price = Convert.ToDouble(hDocument.DocumentNode.SelectNodes("//td[@class='ask']").FirstOrDefault().InnerText.Trim());
        _WebBrowser.FindForm().Close();
        _lock.Set();

    }

if your trying to do this in a console application, you need to put this tag above your main, because Windows needs to communicate with COM Components:如果您尝试在控制台应用程序中执行此操作,则需要将此标记放在 main 上方,因为 Windows 需要与 COM 组件进行通信:

[STAThread]
    static void Main(string[] args)

I did not like this solution, But I think that is no one better!我不喜欢这个解决方案,但我认为没有人更好!

使用“WebBrowser.Navigated”事件怎么样?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM