简体   繁体   English

W3C工具的HTML敏捷包问题

[英]HTML Agility Pack Problems with W3C tools

I'm trying to access the HTML result of the w3C mobileOK Checker by passing a url such as 我正在尝试通过传递如下网址来访问w3C mobileOK检查器的HTML结果:

http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F

The URL works if you put it in a browser but I can't seem to be able to access it via the HTMLAgilityPack. 如果将URL放在浏览器中,则该URL可以使用,但是我似乎无法通过HTMLAgilityPack访问它。 The reason for this probably is that the URL needs to send a number of requests to it's server since it's an online testing, therefore it's not just a "static" URL. 原因可能是URL是在线测试,因此需要向其服务器发送许多请求,因此,它不仅仅是一个“静态” URL。 I have accessed other URLs without any problems. 我已经访问了其他URL,没有任何问题。 Below is my code: 下面是我的代码:

HtmlAgilityPack.HtmlDocument webGet = new HtmlAgilityPack.HtmlDocument();
HtmlWeb hw = new HtmlWeb();
webGet = hw.Load("http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F");
HtmlNodeCollection nodes = webGet.DocumentNode.SelectNodes("//head");

if (nodes != null)
{
    foreach(HtmlNode n in nodes)
    {
        string x = n.InnerHtml;
    }                        
}

Edit: I tried to access it via Stream Reader and the website returns the following error: The remote server returned an error: (403) Forbidden. 编辑:我试图通过Stream Reader访问它,并且网站返回以下错误:远程服务器返回错误:(403)禁止。 I'm guessing that it's related. 我猜这是相关的。

I checked your example and was able to verify the described behaviour. 我检查了您的示例,并能够验证所描述的行为。 It seems to me that w3.org checks if the request program is a browser or anything else . 在我看来, w3.org检查请求程序是浏览器还是其他任何程序。

I created an extended webClient class for another project on my own, and was able to access the given url with success. 我自己为另一个项目创建了一个扩展的webClient类 ,并且能够成功访问给定的URL。

Program.cs Program.cs中

WebClientExtended client = new WebClientExtended();
string exportPath = @"e:\temp"; // adapt to your own needs

string url = "http://validator.w3.org/mobile/check?async=false&docAddr=http%3A%2F%2Fwww.google.com/%2Ftv%2F";
/// load html by using cusomt webClient class
/// but use HtmlAgilityPack for parsing, manipulation aso
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(System.Text.Encoding.UTF8.GetString(client.DownloadData(url)));
doc.Save(Path.Combine(exportPath, "check.html"));

WebClientExtended WebClientExtended

public class WebClientExtended : WebClient
{
    #region Felder
    private CookieContainer container = new CookieContainer();
    #endregion

    #region Eigenschaften
    public CookieContainer CookieContainer
    {
        get { return container; }
        set { container = value; }
    }
    #endregion

    #region Konstruktoren
    public WebClientExtended()
    {
        this.container = new CookieContainer();
    }
    #endregion

    #region Methoden
    protected override WebRequest GetWebRequest(Uri address)
    {
        WebRequest r = base.GetWebRequest(address);
        var request = r as HttpWebRequest;
        request.AllowAutoRedirect = false; 
        request.ServicePoint.Expect100Continue = false;
        if (request != null)
        {
            request.CookieContainer = container;
        }

        ((HttpWebRequest)r).Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
        ((HttpWebRequest)r).UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"; //IE

        r.Headers.Set("Accept-Encoding", "gzip, deflate, sdch");
        r.Headers.Set("Accept-Language", "de-AT,de;q=0.8,en;q=0.6,en-US;q=0.4,fr;q=0.2");
        r.Headers.Add(System.Net.HttpRequestHeader.KeepAlive, "1");

        ((HttpWebRequest)r).AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
        return r;
    }

    protected override WebResponse GetWebResponse(WebRequest request)
    {
        WebResponse response = base.GetWebResponse(request);

        if (!string.IsNullOrEmpty(response.Headers["Location"]))
        {
            request = GetWebRequest(new Uri(response.Headers["Location"]));
            request.ContentLength = 0;
            response = GetWebResponse(request);
        }

        return response;
    }
    #endregion
}

I think the crucial point is the addition/manipulation of userAgent, Accept-encoding, -language strings . 我认为关键是userAgent,Accept-encoding和-language字符串添加/操作 The result of my code is the downloaded page check.html . 我的代码的结果是下载的页面check.html

下载原始HTML页面

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM