[英]Get Status Code Using Parallel
My goal is to get status code for about 5k URL. 我的目标是获取约5k URL的状态代码。
Constraints: 限制条件:
1/ if the URL A redirects to URL B, then get the status code of the URL B. 1 /如果URL A重定向到URL B,则获取URL B的状态码。
2/ If it's timed out, then retry for 3 times. 2 /如果超时,则重试3次。
This is what I implemented: 这是我实现的:
Parallel.ForEach(
linkList,
new ParallelOptions() {MaxDegreeOfParallelism=64},
link=>
{
HtmlAnalyzor htmlAnalyzor = new HtmlAnalyzor(link.URL);
int statusCode=-1;
for (int retryTime = 2; retryTime >= 0; retryTime--)
{
statusCode = htmlAnalyzor.GetDestinationURLStatusCode(link.URL, link.IdQualityPage,retryTime);
if (statusCode!=-1 && statusCode!=0) { break; }
}
linkStatusCodeDic.Add(link, statusCode);
});
public int GetDestinationURLStatusCode(string originalURL,int qPageId, int retryTime)
{
try
{
Console.WriteLine("URL:{0}",originalURL);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(originalURL);
request.Method = "HEAD";
request.Timeout = 10000;
//Half of the time, the line below will throw a WebException and give me a statusCode=0;
_Response = (HttpWebResponse)request.GetResponse();
string destURL = _Response.ResponseUri.ToString();
if (originalURL != destURL)
{
GetDestinationURLStatusCode(destURL,qPageId,retryTime);
}
int statusCode = (int)_Response.StatusCode;
_Response.Close();
Console.WriteLine("Normal:{0}", statusCode);
return statusCode;
}catch(WebException webEx)
{
int statusCode = 0;
if (webEx.Status == WebExceptionStatus.ProtocolError)
{
//statusCode = (int)((HttpWebResponse)webEx.Response).StatusCode;
Console.WriteLine("WebEx:{0}", statusCode);
}
if (_Response != null)
{
_Response.Close();
}
return statusCode;
}
catch(Exception ex)
{
if (_Response != null)
{
_Response.Close();
}
if(retryTime==0)
{
Console.WriteLine("Failed to get status code for URL['{1}'] on the Page[Code:{2}].{0}ErrorMessage:{3}", Environment.NewLine, _URL, pageId, ex.Message);
}
return -1;
}
}
Result Of My Code: half of the time, it will throw a WebException and give me a status code = 0. 我的代码的结果:一半时间,它将抛出WebException并给我一个状态代码= 0。
What I've tried to change this situation: 我试图改变这种情况的方法:
1/ I've changed MaxDegreeOfParallelism to 40 and 20, it doesn't work. 1 /我已经将MaxDegreeOfParallelism更改为40和20,这是行不通的。
2/ I've changed request.TimeOut to 20s, 30s, even 90s, it doesn't work. 2 /我将request.TimeOut更改为20s,30s,甚至90s,它不起作用。
I've changed my code, now it's working. 我已经更改了代码,现在可以正常工作了。 The main points that I've changed are:
我更改的要点是:
delete:new ParallelOptions() {MaxDegreeOfParallelism=64} delete:new ParallelOptions(){MaxDegreeOfParallelism = 64}
using parallel first, then use tradition for loop to deal with the ones fails in parallel. 首先使用并行,然后使用传统的循环来并行处理失败的事件。 This increase the percentage of success.
这增加了成功的百分比。
some parameters are modified for httpwebrequest: 为httpwebrequest修改了一些参数:
request.UserAgent ="html-analyzor"; request.UserAgent =“ html-analyzor”;
request.KeepAlive = false; request.KeepAlive = false;
request.Timeout =15000; request.Timeout = 15000;
Here's the code: 这是代码:
List<QualityPageLink> linkListToRetrySync = new List<QualityPageLink>();
ServicePointManager.DefaultConnectionLimit = 1000;
Parallel.ForEach(
linkList,
//new ParallelOptions() { //MaxDegreeOfParallelism = 64 },
link =>
{
HtmlAnalyzor htmlAnalyzor = new HtmlAnalyzor(link.URL);
int statusCode = -1;
for (int retryTime = 2; retryTime >= 0; retryTime--)
{
statusCode = htmlAnalyzor.GetDestinationURLStatusCode(link.URL, link.IdQualityPage, retryTime);
if (statusCode > 0) { break; }
if (statusCode != 200) { linkListToRetrySync.Add(link); }
linkIdStatusCodeDic.Add(link, statusCode);
});
if(linkListToRetrySync!=null && linkListToRetrySync.Count()!=0)
{
for (int i = 0; i < linkListToRetrySync.Count(); i++)
{
var link = linkListToRetrySync[i];
int statusCode = -1;
HtmlAnalyzor htmlAnalyzor = new HtmlAnalyzor(link.URL);
for (int retryTime = 2; retryTime >= 0; retryTime--)
{
statusCode = htmlAnalyzor.GetDestinationURLStatusCode(link.URL, link.IdQualityPage, retryTime);
if (statusCode > 0) { break; }
}
linkIdStatusCodeDic[link] = statusCode;
}
}
public int GetDestinationURLStatusCode(string originalURL, int qPageId, int retryTime)
{
HttpWebRequest request;
int statusCode = -1;
//HttpWebResponse response = null;
try
{
Console.WriteLine("URL:{0}", Helper.ToString(originalURL));
request = (HttpWebRequest)WebRequest.Create(originalURL);
request.UserAgent = "html-analyzor";
request.KeepAlive = false;
request.Timeout = 15000;
using (this._Response = (HttpWebResponse)request.GetResponse())
{
statusCode = (int)_Response.StatusCode;
}
//string destURL = _Response.ResponseUri.ToString();
//if (originalURL != destURL)
//{
// GetDestinationURLStatusCode(destURL, qPageId, retryTime);
//}
Console.WriteLine("Normal:{0}", statusCode);
return statusCode;
}
catch (WebException webEx)
{
statusCode = 0;
if (webEx.Status == WebExceptionStatus.ProtocolError)
{
statusCode = (int)((HttpWebResponse)webEx.Response).StatusCode;
Console.WriteLine("WebEx:{0}", statusCode);
}
if (this._Response != null)
{
this._Response.Close();
this._Response = null;
}
return statusCode;
}
catch(Exception ex)
{
if (this._Response != null)
{
this._Response.Close();
this._Response = null;
}
if (retryTime == 0)
{
// Console.WriteLine("Failed.");
}
return -1;
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.