简体   繁体   English

C#WebClient仅下载部分html

[英]C# WebClient only downloads partial html

I am working on some scraping app, i wanted to try to get it to work but ran into a problem. 我正在开发一些抓取应用程序,我想尝试使其正常运行,但遇到了问题。 I have replaced the original scraping destination in the below code with googles webpage, just for testing. 我已将以下代码中的原始抓取目标替换为Google网页,仅用于测试。 It seems that my download doesnt get everything, i note that the body and the html tags are missing their close tags. 似乎我的下载内容无法全部显示,我注意到正文和html标签缺少其close标签。 How do i get it to download everything? 我如何下载所有内容? Whats wrong with my sample code: 我的示例代码有什么问题:

string filename = "test.html";

WebClient client = new WebClient();            
string searchTerm = HttpUtility.UrlEncode(textBox2.Text);            
client.QueryString.Add("q", searchTerm);
client.QueryString.Add("hl", "en");
string data = client.DownloadString("http://www.google.com/search");

StreamWriter writer = new StreamWriter(filename, false, Encoding.Unicode);
writer.Write(data);
writer.Flush();
writer.Close();

Google's web pages are now in HTML 5, meaning the BODY and HTML tags can be self-closed - which is why Google omits them (believe it or not, it saves them bandwidth.) Google的网页现在采用HTML 5,这意味着BODYHTML标签可以自动关闭-这就是Google忽略它们的原因(信不信由你,它节省了它们的带宽)。

See this article . 看到这篇文章

You can write HTML5 in either "HTML/SGML" mode (which allows the omitting of closing tags like HTML did prior to XHTML) or in "XHTML" which follows the rules of XML, requiring all tags to be closed. 您可以以“ HTML / SGML”模式(允许像XHTML之前的HTML那样省略关闭标签)来编写HTML5,也可以按照XML的规则(要求所有标签都关闭)来编写HTML5。

Which the browser chooses to parse the page depends on whether you send a Content-type header of text/html for HTML/SGML syntax or application/xhtml+xml for XHTML syntax. 浏览器选择解析页面的方式取决于您是针对HTML / SGML语法发送text/htmlContent-type标头,还是针对XHTML语法发送application/xhtml+xml的。 (Source: HTML5 syntax - HTML vs XHTML ) (来源: HTML5语法-HTML与XHTML

...Google's page doesn't have the closing tags for <body> and <html> . ... Google的页面没有<body><html>的结束标记。 Talk about crazy optimization... 谈论疯狂的优化...

http://www.google.com/search没有结束标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM