简体   繁体   English

特殊字符不适用于使用页面编码的WebClient DownloadString

[英]Special chars not working with WebClient DownloadString using page's encoding

EDIT : The characters come correctly, but in the middle of the page there's this line <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd"> , afterwards the special characters become é as &eacute; 编辑 :字符正确,但在页面的中间有这行<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd"> ,之后特殊字符é成为&eacute; <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd"> (that are represented fine in browser), but are represented as eacute; (在浏览器中可以很好地表示),但可以表示为eacute; (without the & ) if downloaded via WebClient. (不带 )(如果通过WebClient下载)。 END EDIT 结束编辑

I am extracting an excerpt from a web using WebClient + RegEx. 我正在使用WebClient + RegEx从网络中摘录。

But setting the encoding correctly still makes é as eacute; 但是正确设置编码仍然会使é变得eacute; , ç as ccedil; ç作为ccedil; , í as iacute; íiacute; etc. 等等

I followed DownloadString and Special Characters example to correctly set the charset ( ISO-8859-1 ): 我按照DownloadString和Special Characters示例正确设置了字符集( ISO-8859-1 ):

System.Net.WebClient wc = new System.Net.WebClient();
wc.DownloadString("https://myurl"); //
var contentType = wc.ResponseHeaders["Content-Type"];
var charset = Regex.Match(contentType, "charset=([^;]+)").Groups[1].Value;
wc.Encoding = Encoding.GetEncoding(charset);

It does set charset like the document's ( ISO-8859-1 ), but when i do the follow-up DownloadString ( i know i could set the encoding before and just do one wc.DownloadString , but i wanted to follolw the accepted answer's example ): 它的确像文档( ISO-8859-1 )一样设置charset ,但是当我执行后续的DownloadString我知道我可以设置编码,然后只做一个wc.DownloadString ,但是我想跟踪接受的答案的示例 ):

string result = wc.DownloadString("https://myurl");

The special characters still come wrong. 特殊字符仍然出错。

NOTE : I am using a non-English Windows 10 (if it's relevant) 注意 :我使用的是非英语的Windows 10(如果相关)

NOTE 2 : The page's special characters appear correctly in any browser 注意2 :页面的特殊字符在任何浏览器中都能正确显示

My question is, why the WebClient don't download correctly even with the correct charset set? 我的问题是,为什么即使使用正确的字符集, WebClient也无法正确下载?

using System.Text; 使用System.Text;

wc.Encoding = Encoding.UTF8; wc.Encoding = Encoding.UTF8;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM