简体   繁体   English

使用“&”将XML读入C#XMLDocument对象

[英]Reading XML with an “&” into C# XMLDocument Object

I have inherited a poorly written web application that seems to have errors when it tries to read in an xml document stored in the database that has an "&" in it. 我继承了一个编写得很糟糕的Web应用程序,当它试图读入存储在数据库中的“&”中的xml文档时似乎有错误。 For example there will be a tag with the contents: "Prepaid & Charge". 例如,将有一个包含内容的标签:“预付费和费用”。 Is there some secret simple thing to do to have it not get an error parsing that character, or am I missing something obvious? 是否有一些秘密简单的事情可以让它解析那个角色没有错误,或者我错过了一些明显的东西?

EDIT: Are there any other characters that will cause this same type of parser error for not being well formed? 编辑:是否有任何其他字符会导致相同类型的解析器错误,因为格式不正确?

The problem is that the xml is not well-formed. 问题是xml格式不正确。 Properly generated xml would list that data like this: 正确生成的xml会列出这样的数据:

Prepaid & Charge

I've had to fix the same problem before, and I did it with this regex: 我以前必须解决同样的问题,我用这个正则表达式做了:

Regex badAmpersand = new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)");

Combine that with a string constant defined like this: 将它与定义如下的字符串常量相结合:

const string goodAmpersand = "&";

Now you can just say badAmpersand.Replace(<your input>, goodAmpersand); 现在你可以说badAmpersand.Replace(<your input>, goodAmpersand);

Note that a simple String.Replace("&", "&amp;") isn't good enough, since you can't know in advance for a given document whether any & characters will be coded correctly, incorrectly, or even both in the same document. 请注意,简单的String.Replace("&", "&amp;")不够好,因为您无法事先知道给定文档是否正确编码任何&字符,或者两者都是同一份文件。

The catches here are that you have to do this to your xml document before loading it into your parser, which likely means an extra pass through it. 这里的捕获是你必须将它加载到解析器之前对你的xml文档执行此操作,这可能意味着额外的通过它。 Also, it does not account for ampersands inside of a CDATA section. 此外,它没有考虑CDATA部分内的&符号。 Finally, it only catches ampersands, not other illegal characters like <. 最后, 它只捕获&符号,而不是其他非法字符,如<。 Update: based on the comment, I need to update the expression for hex-coded (&#x...;) entities as well. 更新:根据评论,我还需要更新十六进制编码(&#x ...;)实体的表达式。

Regarding which characters can cause problems, the actual rules are a little complex. 关于哪些字符可能导致问题,实际规则有点复杂。 For example, certain characters are allowed in data, but not as the first letter of an element name. 例如,数据中允许使用某些字符,但不能作为元素名称的第一个字母。 And there's no simple list of illegal characters. 而且没有简单的非法字符列表。 Instead, a large (non-contiguous) swath of UNICODE is defined as legal , and anything outside of that is illegal. 相反,一个大的(非连续的)UNICODE区被定义为合法的 ,并且除此之外的任何东西都是非法的。

So when it comes down to it, you have to trust your document source to have at least a certain amount of compliance and consistency. 因此,当涉及到它时,您必须相信您的文档源至少具有一定的合规性和一致性。 For example, I've found that people are often smart enough to make sure the tags work properly and escape <, even if they don't know that & isn't allowed, hence your problem today. 例如,我发现人们通常足够聪明,以确保标签正常工作并逃脱<,即使他们不知道&不允许,因此今天你的问题。 However, the best thing would be to get this fixed at the source. 但是, 最好的方法是在源头修复此问题。

Oh, and a note about the CDATA suggestion: I'd use that to make sure xml that I'm creating is well-formed, but when dealing with existing xml from outside, I find the regex method easier. 哦,以及关于CDATA建议的说明:我会用它来确保我正在创建的 xml格式正确,但是当从外部处理现有的xml时,我发现正则表达式方法更容易。

The web application isn't at fault, the XML document is. Web应用程序没有错,XML文档是。 Ampersands in XML should be encoded as &amp; XML中的&amp;符号应编码为&amp; . Failure to do so is a syntax error. 不这样做是语法错误。

Edit: in answer to the followup question, yes there are all kinds of similar errors. 编辑:在回答后续问题时,是的,有各种类似的错误。 For example, unbalanced tags, unencoded less-than signs, unquoted attribute values, octets outside of the character encoding and various Unicode oddities, unrecognised entity references, and so on. 例如,不平衡标签,未编码的小于标志,不带引号的属性值,字符编码之外的八位字节和各种Unicode奇怪,未识别的实体引用等等。 In order to get any decent XML parser to consume a document, that document must be well-formed. 为了让任何体面的XML解析器使用文档,该文档必须格式正确。 The XML specification requires that a parser encountering a malformed document throw a fatal error. XML规范要求解析器遇到格式错误的文档会导致致命错误。

The other answers are all correct, and I concur with their advice, but let me just add one thing: 其他答案都是正确的,我同意他们的建议,但我只想补充一点:

PLEASE do not make applications that work with non well-formed XML, it just makes the rest of our lives more difficult :). 请不要使应用程序与非格式良好的XML一起工作,这只会让我们的余生变得更加困难:)。

Granted, there are times when you really just don't have a choice if you have no control over the other end, but you should really have it throwing a fatal error and complaining very loudly and explicitly about what is broken when such an event occurs. 当然,有些时候你真的只是没有选择,如果你无法控制另一端,但是你应该真的让它抛出一个致命的错误并且非常大声地抱怨当发生这样的事件时会发生什么损坏。

You could probably take it one step further and say "Ack! This XML is broken in these places and for these reasons, here's how I tried to fix it to make it well-formed: ...". 您可以更进一步说“Ack!这些XML在这些地方被打破了,出于这些原因,我试图修复它以使其形成良好的形式:......”。

I'm not overly familiar with the MSXML APIs, but most good XML parsers will allow you to install error handlers so that you can trap the exact line/column number where errors are appearing along with getting the error code and message. 我并不过分熟悉MSXML API,但是大多数优秀的XML解析器都允许您安装错误处理程序,以便您可以捕获出现错误的确切行/列号以及获取错误代码和消息。

Your database doesn't contain XML documents. 您的数据库不包含XML文档。 It contains some well-formed XML documents and some strings that look like XML to a human. 它包含一些格式良好的XML文档和一些看起来像XML的字符串。

If it's at all possible, you should fix this - in particular, you should fix whatever process is generating the malformed XML documents. 如果可能的话,你应该解决这个问题 - 特别是,你应该修复生成格式错误的XML文档的任何进程。 Fixing the program that reads data out of this database is just putting wallpaper over a crack in the wall. 修复从该数据库中读取数据的程序只是将壁纸放在墙上的裂缝中。

You can replace & with &amp; 你可以替换和&amp;

Or you might also be able to use CDATA sections. 或者您也可以使用CDATA部分。

There are several characters which will cause XML data to be reported as badly-formed. 有几个字符会导致XML数据报告为格式错误。

From w3schools : 来自w3schools

Characters like "<" and "&" are illegal in XML elements. 像“<”和“&”这样的字符在XML元素中是非法的。

The best solution for input you can't trust to be XML-compliant is to wrap it in CDATA tags, eg 对于您不能信任的符合XML的输入的最佳解决方案是将其包装在CDATA标签中,例如

<![CDATA[This is my wonderful & great user text]]>

Everything within the <![CDATA[ and ]]> tags is ignored by the parser. 解析器会忽略<![CDATA[]]>标记内的所有内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM