简体   繁体   English

使用并行获取状态码

[英]Get Status Code Using Parallel

My goal is to get status code for about 5k URL. 我的目标是获取约5k URL的状态代码。

Constraints: 限制条件:
1/ if the URL A redirects to URL B, then get the status code of the URL B. 1 /如果URL A重定向到URL B,则获取URL B的状态码。
2/ If it's timed out, then retry for 3 times. 2 /如果超时,则重试3次。

This is what I implemented: 这是我实现的:

  Parallel.ForEach(
                linkList,
                new ParallelOptions() {MaxDegreeOfParallelism=64},
                link=>
                    {
                        HtmlAnalyzor htmlAnalyzor = new HtmlAnalyzor(link.URL);
                        int statusCode=-1;
                        for (int retryTime = 2; retryTime >= 0; retryTime--)
                        {
                            statusCode = htmlAnalyzor.GetDestinationURLStatusCode(link.URL, link.IdQualityPage,retryTime);
                            if (statusCode!=-1 && statusCode!=0) { break; }
                        }
                        linkStatusCodeDic.Add(link, statusCode);
                    });



public int GetDestinationURLStatusCode(string originalURL,int qPageId, int retryTime)
        {
            try
            {
                Console.WriteLine("URL:{0}",originalURL);
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(originalURL);
                request.Method = "HEAD";
                request.Timeout = 10000;

//Half of the time, the line below will throw a WebException and give me a statusCode=0;
                _Response = (HttpWebResponse)request.GetResponse(); 

            string destURL = _Response.ResponseUri.ToString();
            if (originalURL != destURL)
            {
                GetDestinationURLStatusCode(destURL,qPageId,retryTime);
            }
            int statusCode = (int)_Response.StatusCode;
            _Response.Close();
            Console.WriteLine("Normal:{0}", statusCode);
            return statusCode;
        }catch(WebException webEx)
        {
            int statusCode = 0;
            if (webEx.Status == WebExceptionStatus.ProtocolError)
            {
                //statusCode = (int)((HttpWebResponse)webEx.Response).StatusCode;
                Console.WriteLine("WebEx:{0}", statusCode);
            }
            if (_Response != null)
            {
                _Response.Close();
            }
            return statusCode;


        }
        catch(Exception ex)
        {
            if (_Response != null)
            {
                _Response.Close();
            }
            if(retryTime==0)
            {
                Console.WriteLine("Failed to get status code for URL['{1}'] on the Page[Code:{2}].{0}ErrorMessage:{3}", Environment.NewLine, _URL, pageId, ex.Message);
            }

            return -1;
        }
}

Result Of My Code: half of the time, it will throw a WebException and give me a status code = 0. 我的代码的结果:一半时间,它将抛出WebException并给我一个状态代码= 0。
What I've tried to change this situation: 我试图改变这种情况的方法:
1/ I've changed MaxDegreeOfParallelism to 40 and 20, it doesn't work. 1 /我已经将MaxDegreeOfParallelism更改为40和20,这是行不通的。
2/ I've changed request.TimeOut to 20s, 30s, even 90s, it doesn't work. 2 /我将request.TimeOut更改为20s,30s,甚至90s,它不起作用。

I've changed my code, now it's working. 我已经更改了代码,现在可以正常工作了。 The main points that I've changed are: 我更改的要点是:

  1. delete:new ParallelOptions() {MaxDegreeOfParallelism=64} delete:new ParallelOptions(){MaxDegreeOfParallelism = 64}

  2. using parallel first, then use tradition for loop to deal with the ones fails in parallel. 首先使用并行,然后使用传统的循环来并行处理失败的事件。 This increase the percentage of success. 这增加了成功的百分比。

  3. some parameters are modified for httpwebrequest: 为httpwebrequest修改了一些参数:

    request.UserAgent ="html-analyzor"; request.UserAgent =“ html-analyzor”;
    request.KeepAlive = false; request.KeepAlive = false;
    request.Timeout =15000; request.Timeout = 15000;

Here's the code: 这是代码:

List<QualityPageLink> linkListToRetrySync = new List<QualityPageLink>();
    ServicePointManager.DefaultConnectionLimit = 1000;
    Parallel.ForEach(
         linkList,
         //new ParallelOptions() { //MaxDegreeOfParallelism = 64 },
         link =>
         {
          HtmlAnalyzor htmlAnalyzor = new HtmlAnalyzor(link.URL);
          int statusCode = -1;
          for (int retryTime = 2; retryTime >= 0; retryTime--)
          {
              statusCode = htmlAnalyzor.GetDestinationURLStatusCode(link.URL, link.IdQualityPage, retryTime);
              if (statusCode > 0) { break; }
              if (statusCode != 200) { linkListToRetrySync.Add(link); }
              linkIdStatusCodeDic.Add(link, statusCode);
          });


if(linkListToRetrySync!=null && linkListToRetrySync.Count()!=0)
{
      for (int i = 0; i < linkListToRetrySync.Count(); i++)
      {
           var link = linkListToRetrySync[i];
           int statusCode = -1;
           HtmlAnalyzor htmlAnalyzor = new HtmlAnalyzor(link.URL);
           for (int retryTime = 2; retryTime >= 0; retryTime--)
           {
               statusCode = htmlAnalyzor.GetDestinationURLStatusCode(link.URL, link.IdQualityPage, retryTime);
               if (statusCode > 0) { break; }
           }
           linkIdStatusCodeDic[link] = statusCode;
            }
    }

 public int GetDestinationURLStatusCode(string originalURL, int qPageId, int retryTime)
        {
            HttpWebRequest request;
            int statusCode = -1;
            //HttpWebResponse response = null;
            try
            {
                Console.WriteLine("URL:{0}", Helper.ToString(originalURL));
                request = (HttpWebRequest)WebRequest.Create(originalURL);
                request.UserAgent = "html-analyzor";
                request.KeepAlive = false;
                request.Timeout = 15000;

            using (this._Response = (HttpWebResponse)request.GetResponse())
            {
                statusCode = (int)_Response.StatusCode;
            }

            //string destURL = _Response.ResponseUri.ToString();
            //if (originalURL != destURL)
            //{
            //    GetDestinationURLStatusCode(destURL, qPageId, retryTime);
            //}

            Console.WriteLine("Normal:{0}", statusCode);
            return statusCode;
        }
        catch (WebException webEx)
        {
            statusCode = 0;
            if (webEx.Status == WebExceptionStatus.ProtocolError)
            {
                statusCode = (int)((HttpWebResponse)webEx.Response).StatusCode;
                Console.WriteLine("WebEx:{0}", statusCode);
            }
            if (this._Response != null)
            {
                this._Response.Close();
                this._Response = null;
            }
            return statusCode;
        }
        catch(Exception ex)
        {
            if (this._Response != null)
            {
                this._Response.Close();
                this._Response = null;
            }
            if (retryTime == 0)
            {
                // Console.WriteLine("Failed.");
            }

            return -1;
        }

    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM