简体   繁体   English

解码XML中的扩展字符

[英]Decoding extended characters in XML

I know this is probably simple and has probably been asked before, but I'm having trouble coming up with a solution. 我知道这可能很简单,可能之前曾有人问过,但是我在想出解决方案时遇到了麻烦。

I am parsing some RSS feeds which include HTML as CDATA blocks. 我正在解析一些RSS提要,其中包括HTML作为CDATA块。 One example is here: http://g.msn.com/1ewenus50/news2 这里是一个示例: http : //g.msn.com/1ewenus50/news2

The feed changes a lot, but there are almost always some extended characters in it. 提要变化很大,但是几乎总是包含一些扩展字符。 For example if I make a simple console app and use WebClient.DownloadString and look at the result, I see things like 例如,如果我制作一个简单的控制台应用程序并使用WebClient.DownloadString并查看结果,我会看到类似

"learned of the alleged attempted Flight 253 bomber’s extremist links while he was mid-flight on Christmas Day. NBC’s Savannah Guthrie reports. (Today Show)" “得知他在圣诞节中途飞行时曾试图进行253航班炸弹袭击者的极端分子联系。美国全国广播公司的萨凡纳·古思里报道。(今日节目)”

However those weird characters should be apostrophes, quote marks, em dashes, etc. 但是,这些怪异的字符应该是撇号,引号,破折号等。

What is the trick for getting these to decode correctly? 使它们正确解码的诀窍是什么?

If it wasn't clear, I'm using C# / .NET for this. 如果不清楚,我正在使用C#/ .NET。 In the end this content will be rendered in Silverlight, but I'm seeing the issue in the full .NET 3.5 runtime as well. 最后,这些内容将在Silverlight中呈现,但是我也在完整的.NET 3.5运行时中看到了这个问题。

Download it in binary form and parse it as XML. 二进制形式下载它并将其解析为XML。 That should get it right - the XML document should be self-describing in terms of the encoding, but I wouldn't put it past some webservers to advertise it (in headers) as having a different encoding, which would confuse DownloadString . 这样做应该正确无误-XML文档应该在编码方面进行自我描述,但是我不会把它放到某些网络服务器上(以标头的形式)以具有不同编码的方式来宣传它,这会混淆DownloadString

In general, when XML is involved it's worth doing as much as possible within an XML API rather than with the raw data. 通常,涉及XML时,应该在XML API中而不是对原始数据进行尽可能多的处理。

您可能使用了错误的文本编码...我不确定您使用的是哪种还是正确的,但是这可能会让您走上正轨。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM