[英]Special chars not working with WebClient DownloadString using page's encoding
EDIT : The characters come correctly, but in the middle of the page there's this line <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd">
, afterwards the special characters become é
as é
编辑 :字符正确,但在页面的中间有这行
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd">
,之后特殊字符é
成为é
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd">
(that are represented fine in browser), but are represented as eacute;
(在浏览器中可以很好地表示),但可以表示为
eacute;
(without the & ) if downloaded via WebClient. (不带& )(如果通过WebClient下载)。 END EDIT
结束编辑
I am extracting an excerpt from a web using WebClient + RegEx. 我正在使用WebClient + RegEx从网络中摘录。
But setting the encoding correctly still makes é
as eacute;
但是正确设置编码仍然会使
é
变得eacute;
, ç
as ccedil;
,
ç
作为ccedil;
, í
as iacute;
,
í
如iacute;
etc. 等等
I followed DownloadString and Special Characters example to correctly set the charset ( ISO-8859-1
): 我按照DownloadString和Special Characters示例正确设置了字符集(
ISO-8859-1
):
System.Net.WebClient wc = new System.Net.WebClient();
wc.DownloadString("https://myurl"); //
var contentType = wc.ResponseHeaders["Content-Type"];
var charset = Regex.Match(contentType, "charset=([^;]+)").Groups[1].Value;
wc.Encoding = Encoding.GetEncoding(charset);
It does set charset
like the document's ( ISO-8859-1
), but when i do the follow-up DownloadString
( i know i could set the encoding before and just do one wc.DownloadString
, but i wanted to follolw the accepted answer's example ): 它的确像文档(
ISO-8859-1
)一样设置charset
,但是当我执行后续的DownloadString
( 我知道我可以设置编码,然后只做一个wc.DownloadString
,但是我想跟踪接受的答案的示例 ):
string result = wc.DownloadString("https://myurl");
The special characters still come wrong. 特殊字符仍然出错。
NOTE : I am using a non-English Windows 10 (if it's relevant) 注意 :我使用的是非英语的Windows 10(如果相关)
NOTE 2 : The page's special characters appear correctly in any browser 注意2 :页面的特殊字符在任何浏览器中都能正确显示
My question is, why the WebClient
don't download correctly even with the correct charset set? 我的问题是,为什么即使使用正确的字符集,
WebClient
也无法正确下载?
using System.Text; 使用System.Text;
wc.Encoding = Encoding.UTF8; wc.Encoding = Encoding.UTF8;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.