[英]WebClient.DownloadString() returns string with peculiar characters
I have an issue with some content that we are downloading from the web for a screen scraping tool that I am building.对于我正在构建的屏幕抓取工具,我们从网上下载的一些内容存在问题。
in the code below, the string returned from the web client download string method returns some odd characters for the source download for a few (not all) web sites.在下面的代码中,从 web 客户端下载字符串方法返回的字符串返回一些奇怪的字符,用于一些(不是全部)网站的源下载。
I have recently added http headers as below.我最近添加了如下的 http 标头。 Previously the same code was called without the headers to the same effect.以前,相同的代码在没有标题的情况下被调用以达到相同的效果。 I have not tried variations on the 'Accept-Charset' header, I don't know much about text encoding other than the basics.我没有尝试过“Accept-Charset”标头的变体,除了基础知识之外,我对文本编码知之甚少。
The characters, or character sequences that I refer to are:我所指的字符或字符序列是:
"  " “  “
and和
" Â " “A”
These characters are not seen when you use "view source" in a web browser.当您在 Web 浏览器中使用“查看源代码”时,看不到这些字符。 What could be causing this and how can I rectify the problem?可能是什么原因造成的,我该如何解决这个问题?
string urlData = String.Empty;
WebClient wc = new WebClient();
// Add headers to impersonate a web browser. Some web sites
// will not respond correctly without these headers
wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12");
wc.Headers.Add("Accept", "*/*");
wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
urlData = wc.DownloadString(uri);

is the windows-1252 representation of the octets EF BB BF
. 
是八位字节EF BB BF
的windows-1252表示。 That's the UTF-8 byte-order marker , which implies that your remote web page is encoded in UTF-8 but you're reading it as if it were windows-1252. 这是UTF-8字节顺序标记 ,这意味着您的远程网页以UTF-8编码,但您正在阅读它,就像它是windows-1252一样。 According to the docs , WebClient.DownloadString
uses Webclient.Encoding
as its encoding when it converts the remote resource into a string. 根据文档 , WebClient.DownloadString
在将远程资源转换为字符串时使用Webclient.Encoding
作为其编码。 Set it to System.Text.Encoding.UTF8
and things should theoretically work. 将它设置为System.Text.Encoding.UTF8
,理论上应该可以正常工作。
The way WebClient.DownloadString
is implemented is very dumb. WebClient.DownloadString
的实现方式非常愚蠢。 It should get the character encoding from the Content-Type
header in the response, but instead it expects the developer to tell the expected encoding beforehand. 它应该从响应中的Content-Type
标头获取字符编码,但是它希望开发人员事先告诉预期的编码。 I don't know what the developers of this class were thinking. 我不知道这个班的开发人员在想什么。
I have created an auxiliary class that retrieves the encoding name from the Content-Type
header of the response: 我创建了一个辅助类,它从响应的Content-Type
标头中检索编码名称:
public static class WebUtils
{
public static Encoding GetEncodingFrom(
NameValueCollection responseHeaders,
Encoding defaultEncoding = null)
{
if(responseHeaders == null)
throw new ArgumentNullException("responseHeaders");
//Note that key lookup is case-insensitive
var contentType = responseHeaders["Content-Type"];
if(contentType == null)
return defaultEncoding;
var contentTypeParts = contentType.Split(';');
if(contentTypeParts.Length <= 1)
return defaultEncoding;
var charsetPart =
contentTypeParts.Skip(1).FirstOrDefault(
p => p.TrimStart().StartsWith("charset", StringComparison.InvariantCultureIgnoreCase));
if(charsetPart == null)
return defaultEncoding;
var charsetPartParts = charsetPart.Split('=');
if(charsetPartParts.Length != 2)
return defaultEncoding;
var charsetName = charsetPartParts[1].Trim();
if(charsetName == "")
return defaultEncoding;
try
{
return Encoding.GetEncoding(charsetName);
}
catch(ArgumentException ex)
{
throw new UnknownEncodingException(
charsetName,
"The server returned data in an unknown encoding: " + charsetName,
ex);
}
}
}
( UnknownEncodingException
is a custom exception class, feel free to replace for InvalidOperationException
or whatever else if you want) ( UnknownEncodingException
是一个自定义异常类,如果需要,可随意更换InvalidOperationException
或其他任何内容)
Then the following extension method for the WebClient
class will do the trick: 然后, WebClient
类的以下扩展方法将WebClient
:
public static class WebClientExtensions
{
public static string DownloadStringAwareOfEncoding(this WebClient webClient, Uri uri)
{
var rawData = webClient.DownloadData(uri);
var encoding = WebUtils.GetEncodingFrom(webClient.ResponseHeaders, Encoding.UTF8);
return encoding.GetString(rawData);
}
}
So in your example you would do: 所以在你的例子中你会这样做:
urlData = wc.DownloadStringAwareOfEncoding(uri);
...and that's it. ......就是这样。
var client = new WebClient { Encoding = System.Text.Encoding.UTF8 };
var json = client.DownloadString(url);
in my case , i deleted ever header related to language ,charset etc EXcept user agent and cookie . 在我的情况下,我删除了与语言,字符集等相关的标题除了用户代理和cookie。 it worked.. 有效..
// try commenting
//wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
//wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
None of them didn't work for me for some special websites such as "www.yahoo.com". 对于某些特殊网站,例如“www.yahoo.com”,它们都不适用于我。 The only way which I resolve my problem was changing DownloadString
to OpenRead
and using UserAgent
header like sample code. 解决我的问题的唯一方法是将DownloadString
更改为OpenRead
并使用UserAgent
标头,如示例代码。 However, a few sites like "www.varzesh3.com" didn't work with any of methods! 但是,像“www.varzesh3.com”这样的网站并没有使用任何方法!
WebClient client = new WebClient()
client.Headers.Add(HttpRequestHeader.UserAgent, "");
var stream = client.OpenRead("http://www.yahoo.com");
StreamReader sr = new StreamReader(stream);
s = sr.ReadToEnd();
You guys rock!你们好棒! I've been trying to Invoke-WebRequest an AWS URL with an embedded security token.我一直在尝试使用嵌入式安全令牌调用 WebRequest 一个 AWS URL。 I haven't been able to download it for the life of me.我一生都无法下载它。 Finally after adding the Accept and User-Agent displayed in this thread to my headers, I was finally able to download the json objects.最后,在将这个线程中显示的 Accept 和 User-Agent 添加到我的标题后,我终于能够下载 json 对象。 Thank you!谢谢!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.