WebClient.DownloadString() 返回带有特殊字符的字符串

Question

I have an issue with some content that we are downloading from the web for a screen scraping tool that I am building.对于我正在构建的屏幕抓取工具，我们从网上下载的一些内容存在问题。

in the code below, the string returned from the web client download string method returns some odd characters for the source download for a few (not all) web sites.在下面的代码中，从 web 客户端下载字符串方法返回的字符串返回一些奇怪的字符，用于一些（不是全部）网站的源下载。

I have recently added http headers as below.我最近添加了如下的 http 标头。 Previously the same code was called without the headers to the same effect.以前，相同的代码在没有标题的情况下被调用以达到相同的效果。 I have not tried variations on the 'Accept-Charset' header, I don't know much about text encoding other than the basics.我没有尝试过“Accept-Charset”标头的变体，除了基础知识之外，我对文本编码知之甚少。

The characters, or character sequences that I refer to are:我所指的字符或字符序列是：

" ï»¿ " “ ï»¿ “

and和

" Â " “A”

These characters are not seen when you use "view source" in a web browser.当您在 Web 浏览器中使用“查看源代码”时，看不到这些字符。 What could be causing this and how can I rectify the problem?可能是什么原因造成的，我该如何解决这个问题？

string urlData = String.Empty;
WebClient wc = new WebClient();

// Add headers to impersonate a web browser. Some web sites 
// will not respond correctly without these headers
wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12");
wc.Headers.Add("Accept", "*/*");
wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");

urlData = wc.DownloadString(uri);

Answer 1

ï»¿ is the windows-1252 representation of the octets EF BB BF . ï»¿是八位字节EF BB BF的windows-1252表示。 That's the UTF-8 byte-order marker , which implies that your remote web page is encoded in UTF-8 but you're reading it as if it were windows-1252. 这是UTF-8字节顺序标记，这意味着您的远程网页以UTF-8编码，但您正在阅读它，就像它是windows-1252一样。 According to the docs , WebClient.DownloadString uses Webclient.Encoding as its encoding when it converts the remote resource into a string. 根据文档， WebClient.DownloadString在将远程资源转换为字符串时使用Webclient.Encoding作为其编码。 Set it to System.Text.Encoding.UTF8 and things should theoretically work. 将它设置为System.Text.Encoding.UTF8 ，理论上应该可以正常工作。

Answer 2

The way WebClient.DownloadString is implemented is very dumb. WebClient.DownloadString的实现方式非常愚蠢。 It should get the character encoding from the Content-Type header in the response, but instead it expects the developer to tell the expected encoding beforehand. 它应该从响应中的Content-Type标头获取字符编码，但是它希望开发人员事先告诉预期的编码。 I don't know what the developers of this class were thinking. 我不知道这个班的开发人员在想什么。

I have created an auxiliary class that retrieves the encoding name from the Content-Type header of the response: 我创建了一个辅助类，它从响应的Content-Type标头中检索编码名称：

public static class WebUtils
{
    public static Encoding GetEncodingFrom(
        NameValueCollection responseHeaders,
        Encoding defaultEncoding = null)
    {
        if(responseHeaders == null)
            throw new ArgumentNullException("responseHeaders");

        //Note that key lookup is case-insensitive
        var contentType = responseHeaders["Content-Type"];
        if(contentType == null)
            return defaultEncoding;

        var contentTypeParts = contentType.Split(';');
        if(contentTypeParts.Length <= 1)
            return defaultEncoding;

        var charsetPart =
            contentTypeParts.Skip(1).FirstOrDefault(
                p => p.TrimStart().StartsWith("charset", StringComparison.InvariantCultureIgnoreCase));
        if(charsetPart == null)
            return defaultEncoding;

        var charsetPartParts = charsetPart.Split('=');
        if(charsetPartParts.Length != 2)
            return defaultEncoding;

        var charsetName = charsetPartParts[1].Trim();
        if(charsetName == "")
            return defaultEncoding;

        try
        {
            return Encoding.GetEncoding(charsetName);
        }
        catch(ArgumentException ex) 
        {
            throw new UnknownEncodingException(
                charsetName,   
                "The server returned data in an unknown encoding: " + charsetName, 
                ex);
        }
    }
}

( UnknownEncodingException is a custom exception class, feel free to replace for InvalidOperationException or whatever else if you want) （ UnknownEncodingException是一个自定义异常类，如果需要，可随意更换InvalidOperationException或其他任何内容）

Then the following extension method for the WebClient class will do the trick: 然后， WebClient类的以下扩展方法将WebClient ：

public static class WebClientExtensions
{
    public static string DownloadStringAwareOfEncoding(this WebClient webClient, Uri uri)
    {
        var rawData = webClient.DownloadData(uri);
        var encoding = WebUtils.GetEncodingFrom(webClient.ResponseHeaders, Encoding.UTF8);
        return encoding.GetString(rawData);
    }
}

So in your example you would do: 所以在你的例子中你会这样做：

urlData = wc.DownloadStringAwareOfEncoding(uri);

...and that's it. ......就是这样。

Answer 3

var client = new WebClient { Encoding = System.Text.Encoding.UTF8 };

var json = client.DownloadString(url);

Answer 4

in my case , i deleted ever header related to language ,charset etc EXcept user agent and cookie . 在我的情况下，我删除了与语言，字符集等相关的标题除了用户代理和cookie。 it worked.. 有效..

 // try commenting
 //wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
 //wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");

Answer 5

None of them didn't work for me for some special websites such as "www.yahoo.com". 对于某些特殊网站，例如“www.yahoo.com”，它们都不适用于我。 The only way which I resolve my problem was changing DownloadString to OpenRead and using UserAgent header like sample code. 解决我的问题的唯一方法是将DownloadString更改为OpenRead并使用UserAgent标头，如示例代码。 However, a few sites like "www.varzesh3.com" didn't work with any of methods! 但是，像“www.varzesh3.com”这样的网站并没有使用任何方法！

WebClient client = new WebClient()    
client.Headers.Add(HttpRequestHeader.UserAgent, "");
var stream = client.OpenRead("http://www.yahoo.com");
StreamReader sr = new StreamReader(stream);
s = sr.ReadToEnd();

Answer 6

You guys rock!你们好棒！ I've been trying to Invoke-WebRequest an AWS URL with an embedded security token.我一直在尝试使用嵌入式安全令牌调用 WebRequest 一个 AWS URL。 I haven't been able to download it for the life of me.我一生都无法下载它。 Finally after adding the Accept and User-Agent displayed in this thread to my headers, I was finally able to download the json objects.最后，在将这个线程中显示的 Accept 和 User-Agent 添加到我的标题后，我终于能够下载 json 对象。 Thank you!谢谢！

WebClient.DownloadString() 返回带有特殊字符的字符串

问题描述

5 个解决方案

解决方案1
99 已采纳 2011-01-17 18:34:10

解决方案2
48 2015-05-05 09:59:57

解决方案3
12 2016-11-18 06:02:42

解决方案4
0 2017-01-06 19:52:41

解决方案5
0 2017-11-27 09:06:02

解决方案6
0 2022-01-07 15:16:06

WebClient.DownloadString() 返回带有特殊字符的字符串

问题描述

5 个解决方案

解决方案1 99 已采纳 2011-01-17 18:34:10

解决方案2 48 2015-05-05 09:59:57

解决方案3 12 2016-11-18 06:02:42

解决方案4 0 2017-01-06 19:52:41

解决方案5 0 2017-11-27 09:06:02

解决方案6 0 2022-01-07 15:16:06

解决方案1
99 已采纳 2011-01-17 18:34:10

解决方案2
48 2015-05-05 09:59:57

解决方案3
12 2016-11-18 06:02:42

解决方案4
0 2017-01-06 19:52:41

解决方案5
0 2017-11-27 09:06:02

解决方案6
0 2022-01-07 15:16:06