[英]c# download html string after page loading is finished
I am trying to use a loop to download a bunch of html pages and scrap inside data. 我正在尝试使用循环下载一堆html页面并在内部数据中剪贴。 But those pages have some javascript job runing when loading.
但是这些页面在加载时会运行一些javascript作业。 So I am thinking using webclient may not be a good choice.
因此,我认为使用webclient可能不是一个好选择。 But if I use webBrowser like below.
但是,如果我使用如下所示的webBrowser。 it return empty html string after first call in the loop.
它在循环中的第一次调用后返回空的html字符串。
WebBrowser wb = new WebBrowser();
wb.ScrollBarsEnabled = false;
wb.ScriptErrorsSuppressed = true;
wb.Navigate(url);
while (wb.ReadyState != WebBrowserReadyState.Complete) { Application.DoEvents(); Thread.Sleep(1000); }
html = wb.Document.DomDocument.ToString();
Your are correct that WebClient & all of the other HTTP client interfaces will completely ignore JavaScript; 您是正确的,WebClient和所有其他HTTP客户端接口将完全忽略JavaScript; none of them are Browsers after all.
毕竟它们都不是浏览器。
You want: 你要:
var html = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;
Note that if you load via a WebBrowser you don't need to scrape the raw markup; 请注意,如果您通过WebBrowser加载,则无需抓取原始标记。 you can use DOM methods like
GetElementById/TagName
and so on. 您可以使用DOM方法,例如
GetElementById/TagName
等。
The while loop is very VBScript, there is a DocumentCompleted
event you should wire your code into. while循环是非常VBScript,有一个
DocumentCompleted
事件,您应该将代码连接到其中。
private void Whatever()
{
WebBrowser wb = new WebBrowser();
wb.DocumentCompleted += Wb_DocumentCompleted;
wb.ScriptErrorsSuppressed = true;
wb.Navigate("http://stackoverflow.com");
}
private void Wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
var wb = (WebBrowser)sender;
var html = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;
var domd = wb.Document.GetElementById("copyright").InnerText;
/* ... */
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.