简体   繁体   English

HttpWebResponse编码

[英]HttpWebResponse encoding

I have a problem with encoding while trying to get html from google.com. 尝试从google.com获取html时,编码出现问题。 Please, give me a advice how to resolve this problem. 请给我一个建议,以解决此问题。 Thanks a lot. 非常感谢。

public string Html
    {
        get
        {
            try
            {
                var request = WebRequest.Create(Url) as HttpWebRequest;
                request.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.83 Safari/537.1,gzip(gfe)";
                if (request != null)
                {
                    var response = request.GetResponse() as HttpWebResponse;
                    if (response != null)
                    {
                        string Charset = response.CharacterSet;
                        Encoding encoding = Encoding.GetEncoding(Charset);
                        var sr = new StreamReader(response.GetResponseStream(), encoding);
                        return sr.ReadToEnd();
                    }
                }
                return string.Format("Could not create object HttpWebRequest for '{0}'", Url);
            }
            catch (Exception e)
            {
                return e.Message;
            }
        }
    }

Here is an image as well: 这也是一张图片:

在此处输入图片说明

The problem you are facing is because for some reason Google doesn't send out any encoding information in the headers. 您面临的问题是因为某种原因Google不在标头中发送任何编码信息。 If you inspect the headers using the links below (specifically the Content-Type header) and compare the first one (which is from your image) to the second one you will see that the first one is missing some vital information. 如果您使用下面的链接检查标题(特别是Content-Type标题),然后将第一个标题(来自图像)与第二个标题进行比较,您会发现第一个标题缺少一些重要信息。

http://web-sniffer.net/?url=http://www.google.com.ua/intl/ils/ads/ http://web-sniffer.net/?url=http://www.google.com.ua/intl/ils/ads/

http://web-sniffer.net/?url=http://www.google.de/ http://web-sniffer.net/?url=http://www.google.de/

What you need to do here is to first parse the HTML that is returned and look for a <meta> -element which specifies the encoding and then redecode the stream you are getting with that new information. 在这里您需要做的是首先解析返回的HTML,并寻找一个<meta>元素,该元素指定编码,然后使用该新信息重新编码获取的流。 Depending on what you are doing with the HTML afterwards you might want to look into http://htmlagilitypack.codeplex.com/ as a great library for working with HTML or just write a regular expression to extract the encoding (though I would really recommend the first alternative instead). 取决于您之后对HTML的处理方式,您可能希望将http://htmlagilitypack.codeplex.com/作为使用HTML的出色库,或者只是编写一个正则表达式以提取编码(尽管我真的建议您而是第一种选择)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM