[英]ASP.NET Core HtmlAgilityPack Encoding errors
There are some posts regarding encoding questions and HtmlAgilityPack
but this issue wasn't addressed: 有一些关于编码问题和
HtmlAgilityPack
帖子,但未解决此问题:
Because the website I try to parse contains Unicode symbols like €
or ä
, ü
I tried to set the encoding to Unicode: 由于该网站,我尝试解析包含这样的Unicode符号
€
或ä
, ü
我试图编码设置为Unicode:
public class WebpageDeserializer
{
public WebpageDeserializer() {}
/*
* Example address: https://www.dslr-forum.de/showthread.php?t=1930368
*/
public static void Deserialize(string address)
{
var web = new HtmlWeb();
web.OverrideEncoding = Encoding.Unicode;
var htmlDoc = web.Load(address);
//further decoding fails because unicode decoded characters are not proper html (looks more like chinese)
}
}
But now 但现在
htmlDoc.DocumentNode.InnerHtml
looks like this: 看起来像这样:
ℼ佄呃偙⁅瑨汭倠䉕䥌⁃ⴢ⼯㍗⽃䐯䑔堠呈䱍ㄠ〮吠慲獮瑩潩慮⽬䔯≎...
π佄呃偙⁅瑨汭倠䉕䥌⁃ჲ⼯㍗⽃䐯䑔堠呈䱍ㄠ〮吠慲狝莹潩虑⽬䔯≎...
If I try to use UTF-8
or iso-8859-1
the €
symbol is converted to
(as well as ä
, ö
, ü
). 如果我尝试使用
UTF-8
或iso-8859-1
则将€
符号转换为
(以及ä
, ö
, ü
)。 How can I fix this? 我怎样才能解决这个问题?
Your site is mis-configured and the real encoding is cp1252 . 您的网站配置错误,实际编码为cp1252 。
Below code should work: 下面的代码应该工作:
var client = new HttpClient();
var buf = await client.GetByteArrayAsync("https://www.dslr-forum.de/showthread.php?t=1930368");
var html = Encoding.GetEncoding(1252).GetString(buf);
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
instead Encoding.Unicode
use: 而不是
Encoding.Unicode
使用:
web.OverrideEncoding = Encoding.GetEncoding("iso-8859-1");
(tested with your website and german umlauts) (已通过您的网站和德国变音符号进行了测试)
to get the right encoding check the header of the target website. 要获得正确的编码,请检查目标网站的标题。 it contains the right hint:
它包含正确的提示:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.