简体   繁体   English

ASP.NET Core HtmlAgilityPack编码错误

[英]ASP.NET Core HtmlAgilityPack Encoding errors

There are some posts regarding encoding questions and HtmlAgilityPack but this issue wasn't addressed: 有一些关于编码问题和HtmlAgilityPack帖子,但未解决此问题:

Because the website I try to parse contains Unicode symbols like or ä , ü I tried to set the encoding to Unicode: 由于该网站,我尝试解析包含这样的Unicode符号äü我试图编码设置为Unicode:

public class WebpageDeserializer
{
    public WebpageDeserializer() {}

    /*
     * Example address: https://www.dslr-forum.de/showthread.php?t=1930368
    */
    public static void Deserialize(string address)
    {
        var web = new HtmlWeb();
        web.OverrideEncoding = Encoding.Unicode;
        var htmlDoc = web.Load(address);
        //further decoding fails because unicode decoded characters are not proper html (looks more like chinese)
    }
}

But now 但现在

htmlDoc.DocumentNode.InnerHtml

looks like this: 看起来像这样:

ℼ佄呃偙⁅瑨汭倠䉕䥌⁃ⴢ⼯㍗⽃䐯䑔堠呈䱍ㄠ〮吠慲獮瑩潩慮⽬䔯≎... π佄呃偙⁅瑨汭倠䉕䥌⁃ჲ⼯㍗⽃䐯䑔堠呈䱍ㄠ〮吠慲狝莹潩虑⽬䔯≎...

If I try to use UTF-8 or iso-8859-1 the symbol is converted to (as well as ä , ö , ü ). 如果我尝试使用UTF-8iso-8859-1则将符号转换为 (以及äöü )。 How can I fix this? 我怎样才能解决这个问题?

Your site is mis-configured and the real encoding is cp1252 . 您的网站配置错误,实际编码为cp1252

Below code should work: 下面的代码应该工作:

var client = new HttpClient();
var buf = await client.GetByteArrayAsync("https://www.dslr-forum.de/showthread.php?t=1930368");
var html = Encoding.GetEncoding(1252).GetString(buf);
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

instead Encoding.Unicode use: 而不是Encoding.Unicode使用:

web.OverrideEncoding = Encoding.GetEncoding("iso-8859-1");

(tested with your website and german umlauts) (已通过您的网站和德国变音符号进行了测试)

to get the right encoding check the header of the target website. 要获得正确的编码,请检查目标网站的标题。 it contains the right hint: 它包含正确的提示:

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM