[英]Get HtmlDocument after javascript manipulations
In C#, using the System.Windows.Forms.HtmlDocument class (or another class that allows DOM parsing), is it possible to wait until a webpage finishes its javascript manipulations of the HTML before retrieving that HTML?在 C# 中,使用 System.Windows.Forms.HtmlDocument 类(或另一个允许 DOM 解析的类),是否可以等到网页完成对 HTML 的 javascript 操作后再检索该 HTML? Certain sites add innerhtml to pages through javascript, but those changes do not show up when I parse the HtmlElements of the HtmlDocument.
某些站点通过 javascript 将 innerhtml 添加到页面,但是当我解析 HtmlDocument 的 HtmlElements 时,这些更改不会显示。
One possibility would be to update the HtmlDocument of the page after a second.一种可能性是在一秒钟后更新页面的 HtmlDocument。 Does anybody know how to do this?
有人知道怎么做这个吗?
Someone revived this question by posting what I think is an incorrect answer.有人通过发布我认为不正确的答案来重新提出这个问题。 So, here are my thoughts to address it.
所以,这里是我的想法来解决它。
Non-deterministically, it's possible to get close to finding out if the page has finished its AJAX stuff.非确定性地,有可能接近找出页面是否已完成其 AJAX 内容。 However, it completely depends on the logic of that particular page: some pages are perpetually dynamic.
但是,这完全取决于该特定页面的逻辑:有些页面是永久动态的。
To approach this, one can handle DocumentCompleted
event first, then asynchronously poll the WebBrowser.IsBusy
property and monitor the current HTML snapshot of the page for changes, like below.为了解决这个问题,可以先处理
DocumentCompleted
事件,然后异步轮询WebBrowser.IsBusy
属性并监视页面的当前 HTML 快照以进行更改,如下所示。
The complete sample can be found here .可以在此处找到完整的示例。
// get the root element
var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];
// poll the current HTML for changes asynchronosly
var html = documentElement.OuterHtml;
while (true)
{
// wait asynchronously, this will throw if cancellation requested
await Task.Delay(500, token);
// continue polling if the WebBrowser is still busy
if (this.webBrowser.IsBusy)
continue;
var htmlNow = documentElement.OuterHtml;
if (html == htmlNow)
break; // no changes detected, end the poll loop
html = htmlNow;
}
In general aswer is "no" - unless script on the page notifies your code in some way you have to simply wait some time and grab HTML.一般来说,aswer 是“否” - 除非页面上的脚本以某种方式通知您的代码,否则您必须等待一段时间并获取 HTML。 Waiting a second after document ready notification likley will cover most sites (ie jQuery's
$(code)
cases).在文档就绪通知之后等待一秒钟,可能会覆盖大多数站点(即 jQuery 的
$(code)
案例)。
You need to give the application a second to process the Java.您需要给应用程序一些时间来处理 Java。 Simply halting the current thread will delay the java processing as well so your doc will still come up outdated.
简单地停止当前线程也会延迟 java 处理,因此您的文档仍然会过时。
WebBrowserDocumentCompletedEventArgs cachedLoadArgs;
private void TimerDone(object sender, EventArgs e)
{
((Timer)sender).Stop();
respondToPageLoaded(cachedLoadArgs);
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
cachedLoadArgs = e;
System.Windows.Forms.Timer timer = new Timer();
int interval = 1000;
timer.Interval = interval;
timer.Tick += new EventHandler(TimerDone);
timer.Start();
}
I made with WEbBrowser take a look at my class:我用 WEBBrowser 来看看我的班级:
public class MYCLASSProduct: IProduct
{
public string Name { get; set; }
public double Price { get; set; }
public string Url { get; set; }
private WebBrowser _WebBrowser;
private AutoResetEvent _lock;
public void Load(string url)
{
_lock = new AutoResetEvent(false);
this.Url = url;
browserInitializeBecauseJavascriptLoadThePage();
}
private void browserInitializeBecauseJavascriptLoadThePage()
{
_WebBrowser = new WebBrowser();
_WebBrowser.DocumentCompleted += webBrowser_DocumentCompleted;
_WebBrowser.Dock = DockStyle.Fill;
_WebBrowser.Name = "webBrowser";
_WebBrowser.ScrollBarsEnabled = false;
_WebBrowser.TabIndex = 0;
_WebBrowser.Navigate(Url);
Form form = new Form();
form.Hide();
form.Controls.Add(_WebBrowser);
Application.Run(form);
_lock.WaitOne();
}
private void webBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
hDocument.LoadHtml(_WebBrowser.Document.Body.OuterHtml);
this.Price = Convert.ToDouble(hDocument.DocumentNode.SelectNodes("//td[@class='ask']").FirstOrDefault().InnerText.Trim());
_WebBrowser.FindForm().Close();
_lock.Set();
}
if your trying to do this in a console application, you need to put this tag above your main, because Windows needs to communicate with COM Components:如果您尝试在控制台应用程序中执行此操作,则需要将此标记放在 main 上方,因为 Windows 需要与 COM 组件进行通信:
[STAThread]
static void Main(string[] args)
I did not like this solution, But I think that is no one better!我不喜欢这个解决方案,但我认为没有人更好!
使用“WebBrowser.Navigated”事件怎么样?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.