Net WebClient编码不起作用

Question

I'm trying to parse an html document using the .NET WebClient but the characters I'm getting are not correct. 我正在尝试使用.NET WebClient解析html文档，但是我得到的字符不正确。 I have configured lots of Encodings but I cant find why I´m getting it wrong: 我已经配置了许多编码，但是我找不到为什么弄错了：

The URL is http://www.vatican.va/archive/ESL0506/__P2.HTM . 该URL是http://www.vatican.va/archive/ESL0506/__P2.HTM 。

This is my code (you can test it in a ConsoleApp) 这是我的代码（您可以在ConsoleApp中对其进行测试）

    static void Main(string[] args)
    {
        WebClient client = new WebClient();
        client.Encoding = Encoding.GetEncoding(28591);
        var htmlCode = client.DownloadString("http://www.vatican.va/archive/ESL0506/__P2.HTM");

        var splittedHtml = htmlCode.Split('<').ToList();

        var htmlVerses = splittedHtml.Where(x => x.StartsWith("p class=MsoNormal align=left")).ToList();
    }

Then, in htmlVerses I get strings like: 然后，在htmlVerses中，我得到如下字符串：

"p class=MsoNormal align=left style='margin-left:0cm;text-align:left;\ntext-indent:0cm'>3 Entonces Dios dijo: &laquo;Que\nexista la luz&raquo;. Y la luz existi&oacute;."

Check this part: 3 Entonces Dios dijo: «Que\\nexista la luz». Y la luz existió 检查此部分： 3 Entonces Dios dijo: «Que\\nexista la luz». Y la luz existió 3 Entonces Dios dijo: «Que\\nexista la luz». Y la luz existió

Its not well parsed. 它没有很好地解析。 It should be: 3 Entonces Dios dijo: «Que exista la luz». Y la luz existió. 应该是： 3 Entonces Dios dijo: «Que exista la luz». Y la luz existió. 3 Entonces Dios dijo: «Que exista la luz». Y la luz existió.

If we check the chrome source code we get this: 如果我们检查chrome源代码，则会得到以下信息：

在此处输入图片说明

Then I tried to get the source code from http://www.generateit.net/seo-tools/source-viewer/ and I'm getting the same anomally as in my app. 然后，我尝试从http://www.generateit.net/seo-tools/source-viewer/上获取源代码，但我得到的异常与我的应用程序相同。

Its really odd, the encoding that the web page use is charset=iso-8859-1, the same that my webclient uses. 确实很奇怪，网页使用的编码是charset = iso-8859-1，与我的webclient使用的编码相同。

Any help would be appreciated. 任何帮助，将不胜感激。

Answer 1

HTML escapes special characters for transmission, you need to convert them back. HTML会转义特殊字符以进行传输，您需要将它们转换回来。 Fortunately, .NET provides methods to automagically do that for you: 幸运的是，.NET提供了自动为您完成此操作的方法：

HttpUtility.HtmlDecode()

see: MSDN 参见： MSDN

If you are using .NET 4.5 then you can use WebUtility.HtmlDecode() instead, which is already included in System.Net (see: MSDN ) 如果使用的是.NET 4.5，则可以改用WebUtility.HtmlDecode() ，它已包含在System.Net中（请参阅： MSDN ）

Net WebClient编码不起作用

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-01-28 23:01:30

Net WebClient编码不起作用

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-01-28 23:01:30

解决方案1
1 已采纳 2015-01-28 23:01:30