繁体   English   中英

使用HtmlWeb导致HttpWebRequest超时

[英]using HtmlWeb causes HttpWebRequest to timeout

因此,我遇到了一种情况,我正在使用HtmlAgilityPack加载网页以抓取文档内容。 我有许多需要加载的URL,其中一些需要gzip编码,因此我捕获了HtmlWeb.load()引发的异常,检查是否为gzip编码问题,然后使用HttpWebRequest处理页面加载。 然而,这让第一次通过与HttpWebRequest是成功的,但任何其他学尝试与HttpWebRequest就会超时。

这是代码的清理版本:

            HtmlDocument doc = new HtmlDocument();
            HtmlWeb web = new HtmlWeb();
            try
            {
                doc = web.Load(uri);

                Console.WriteLine("htmlweb and htmldocument success");
            }
            catch (ArgumentException ae)
            {
                Console.WriteLine("htmlweb and htmldocument not successful");
                if (ae.Message.Contains("\'gzip\'"))
                {
                    HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(uri);
                    try
                    {
                        req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
                        req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
                        req.Method = "GET";
                        //req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
                        string source;
                        req.KeepAlive = false;
                        //req.Timeout = 100000;

                        // On the second iteration we never get beyond this line
                        using (WebResponse webResponse = req.GetResponse())
                        {
                            using (HttpWebResponse httpWebResponse = webResponse as HttpWebResponse)
                            {
                                using (StreamReader reader = new StreamReader(httpWebResponse.GetResponseStream()))
                                {
                                    source = reader.ReadToEnd();
                                }
                            }
                        }

                        req.Abort();
                        Console.WriteLine("httpwebresponse successfull");
                    }
                    catch (WebException we)
                    {

                        Console.WriteLine("httpwebresponse not successful");
                    }
                }
            }

我需要做一些清理工作吗? 还是我忘记了什么?

任何帮助将不胜感激。

我认为我必须首先通过WebRequest而不是HtmlWeb进行加载。 然后检查响应标头中是否包含gzip,并根据需要每次进行解压缩。

            System.Net.HttpWebRequest req = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create(uri);
            //req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
            //req.AutomaticDecompression = System.Net.DecompressionMethods.Deflate | System.Net.DecompressionMethods.GZip;
            //req.Method = "GET";
            string source = String.Empty;
            try
            {
                using (System.Net.WebResponse webResponse = req.GetResponse())
                {
                    using (HttpWebResponse httpWebResponse = webResponse as HttpWebResponse)
                    {
                        StreamReader reader;
                        if (httpWebResponse.ContentEncoding.ToLower().Contains("gzip"))
                        {
                            reader = new StreamReader(new GZipStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
                        }
                        else if (httpWebResponse.ContentEncoding.ToLower().Contains("deflate"))
                        {
                            reader = new StreamReader(new DeflateStream(httpWebResponse.GetResponseStream(), CompressionMode.Decompress));
                        }
                        else
                        {
                            reader = new StreamReader(httpWebResponse.GetResponseStream());
                        }
                        source = reader.ReadToEnd();
                    }
                }

            req.Abort();
            }
            catch(Exception ex){
                //received a 404 Error - apparently one of my links is now dead...
            }

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM