简体   繁体   中英

screen scraping

i am screen scraping a website which is in danish language.. i am unable to scrape certain characters as like må .. any idea to solve this? thanks

尝试使用UTF-8或Windows-1252字符集。

Its better to use the same encoding that the HttpWebResponse object has, Below is the code that will work with all langauges and characters .

        response = (HttpWebResponse)request.GetResponse();
        string Charset = response.CharacterSet;

        Encoding encoding = Encoding.GetEncoding(Charset);

        if (response.StatusCode == HttpStatusCode.OK)
        {
            response_stream = new StreamReader(response.GetResponseStream(), encoding);

            html = response_stream.ReadToEnd();
        }

If you are using a Web browser control, you can set the page encoding to whatever language that can show that character. Then just extract the page source.

我只是用System.Web.HttpContext.Current.Server.HtmlDecode()它工作..

I use iso-8859-1 for decoding. HTH

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM