简体   繁体   中英

HttpWebResponse encoding

I have a problem with encoding while trying to get html from google.com. Please, give me a advice how to resolve this problem. Thanks a lot.

public string Html
    {
        get
        {
            try
            {
                var request = WebRequest.Create(Url) as HttpWebRequest;
                request.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.83 Safari/537.1,gzip(gfe)";
                if (request != null)
                {
                    var response = request.GetResponse() as HttpWebResponse;
                    if (response != null)
                    {
                        string Charset = response.CharacterSet;
                        Encoding encoding = Encoding.GetEncoding(Charset);
                        var sr = new StreamReader(response.GetResponseStream(), encoding);
                        return sr.ReadToEnd();
                    }
                }
                return string.Format("Could not create object HttpWebRequest for '{0}'", Url);
            }
            catch (Exception e)
            {
                return e.Message;
            }
        }
    }

Here is an image as well:

在此处输入图片说明

The problem you are facing is because for some reason Google doesn't send out any encoding information in the headers. If you inspect the headers using the links below (specifically the Content-Type header) and compare the first one (which is from your image) to the second one you will see that the first one is missing some vital information.

http://web-sniffer.net/?url=http://www.google.com.ua/intl/ils/ads/

http://web-sniffer.net/?url=http://www.google.de/

What you need to do here is to first parse the HTML that is returned and look for a <meta> -element which specifies the encoding and then redecode the stream you are getting with that new information. Depending on what you are doing with the HTML afterwards you might want to look into http://htmlagilitypack.codeplex.com/ as a great library for working with HTML or just write a regular expression to extract the encoding (though I would really recommend the first alternative instead).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM