[英]using HtmlWeb causes HttpWebRequest to timeout
因此,我遇到了一种情况,我正在使用HtmlAgilityPack
加载网页以抓取文档内容。 我有许多需要加载的URL,其中一些需要gzip编码,因此我捕获了HtmlWeb.load()
引发的异常,检查是否为gzip编码问题,然后使用HttpWebRequest
处理页面加载。 然而,这让第一次通过与HttpWebRequest
是成功的,但任何其他学尝试与HttpWebRequest
就会超时。
这是代码的清理版本:
HtmlDocument doc = new HtmlDocument();
HtmlWeb web = new HtmlWeb();
try
{
doc = web.Load(uri);
Console.WriteLine("htmlweb and htmldocument success");
}
catch (ArgumentException ae)
{
Console.WriteLine("htmlweb and htmldocument not successful");
if (ae.Message.Contains("\'gzip\'"))
{
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(uri);
try
{
req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
req.Method = "GET";
//req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string source;
req.KeepAlive = false;
//req.Timeout = 100000;
// On the second iteration we never get beyond this line
using (WebResponse webResponse = req.GetResponse())
{
using (HttpWebResponse httpWebResponse = webResponse as HttpWebResponse)
{
using (StreamReader reader = new StreamReader(httpWebResponse.GetResponseStream()))
{
source = reader.ReadToEnd();
}
}
}
req.Abort();
Console.WriteLine("httpwebresponse successfull");
}
catch (WebException we)
{
Console.WriteLine("httpwebresponse not successful");
}
}
}
我需要做一些清理工作吗? 还是我忘记了什么?
任何帮助将不胜感激。
我认为我必须首先通过WebRequest而不是HtmlWeb进行加载。 然后检查响应标头中是否包含gzip,并根据需要每次进行解压缩。
System.Net.HttpWebRequest req = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create(uri);
//req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
//req.AutomaticDecompression = System.Net.DecompressionMethods.Deflate | System.Net.DecompressionMethods.GZip;
//req.Method = "GET";
string source = String.Empty;
try
{
using (System.Net.WebResponse webResponse = req.GetResponse())
{
using (HttpWebResponse httpWebResponse = webResponse as HttpWebResponse)
{
StreamReader reader;
if (httpWebResponse.ContentEncoding.ToLower().Contains("gzip"))
{
reader = new StreamReader(new GZipStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else if (httpWebResponse.ContentEncoding.ToLower().Contains("deflate"))
{
reader = new StreamReader(new DeflateStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
}
else
{
reader = new StreamReader(httpWebResponse.GetResponseStream());
}
source = reader.ReadToEnd();
}
}
req.Abort();
}
catch(Exception ex){
//received a 404 Error - apparently one of my links is now dead...
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.