简体   繁体   中英

Net WebClient Encoding not working

I'm trying to parse an html document using the .NET WebClient but the characters I'm getting are not correct. I have configured lots of Encodings but I cant find why I´m getting it wrong:

The URL is http://www.vatican.va/archive/ESL0506/__P2.HTM .

This is my code (you can test it in a ConsoleApp)

    static void Main(string[] args)
    {
        WebClient client = new WebClient();
        client.Encoding = Encoding.GetEncoding(28591);
        var htmlCode = client.DownloadString("http://www.vatican.va/archive/ESL0506/__P2.HTM");

        var splittedHtml = htmlCode.Split('<').ToList();

        var htmlVerses = splittedHtml.Where(x => x.StartsWith("p class=MsoNormal align=left")).ToList();
    }

Then, in htmlVerses I get strings like:

"p class=MsoNormal align=left style='margin-left:0cm;text-align:left;\ntext-indent:0cm'>3 Entonces Dios dijo: &laquo;Que\nexista la luz&raquo;. Y la luz existi&oacute;."

Check this part: 3 Entonces Dios dijo: &laquo;Que\\nexista la luz&raquo;. Y la luz existi&oacute; 3 Entonces Dios dijo: &laquo;Que\\nexista la luz&raquo;. Y la luz existi&oacute;

Its not well parsed. It should be: 3 Entonces Dios dijo: «Que exista la luz». Y la luz existió. 3 Entonces Dios dijo: «Que exista la luz». Y la luz existió.

If we check the chrome source code we get this:

在此处输入图片说明

Then I tried to get the source code from http://www.generateit.net/seo-tools/source-viewer/ and I'm getting the same anomally as in my app.

Its really odd, the encoding that the web page use is charset=iso-8859-1, the same that my webclient uses.

Any help would be appreciated.

HTML escapes special characters for transmission, you need to convert them back. Fortunately, .NET provides methods to automagically do that for you:

HttpUtility.HtmlDecode()

see: MSDN

If you are using .NET 4.5 then you can use WebUtility.HtmlDecode() instead, which is already included in System.Net (see: MSDN )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM