XML：仅对某些数据进行转义时如何进行预解析？

Question

XML snippet: XML片段：

<field>&amp; is escaped</field>
<field>&quot;also escaped&quot;</field>
<field>is & "not" escaped</field>
<field>is &quot; and is not & escaped</field>

I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser? 我正在寻找有关如何通过解析器运行XML之前如何预解析任何XML以逃避所有未转义的内容的建议？

I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it. 我无法控制传递给我的XML，他们很可能不会很快修复它，因此我必须找到一种解析它的方法。

The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly 我遇到的主要问题是，按原样运行XML到解析器中，例如（下图）将抛出异常，这是由于XML不好（由于其中一些不能正确转义）

string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))

Answer 1

I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent. 我建议您使用正则表达式将未转义的“＆”号替换为其等效实体。

This question is helpful as it gives you a Regex to find these rogue ampersands: 这个问题很有帮助，因为它为您提供了一个正则表达式来查找这些流氓“＆”号：

&(?!(?:apos|quot|[gl]t|amp);|#)

And you can see that it matches the correct text in this demo . 您会看到它与此演示中的正确文本匹配。 You can use this in a simple replace operation: 您可以在简单的替换操作中使用它：

var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&amp;");

And then you'll be able to parse your XML. 然后，您将能够解析XML。

Answer 2

使用HTML Tidy将quote-ampersand设置为true来预处理文本数据（不是XML）。

Answer 3

If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser. 如果您想解析非XML的内容，则首先需要准确确定该语言的含义以及您打算使用的语言：在为非XML语言编写了要处理的语法后，然后，您可以决定是否可以通过预处理来处理它，或者是否需要完整的解析器。

For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. 例如，如果您只需要处理未转义的“＆”，后跟一个空格，并且如果您不关心注释和CDATA部分中发生的情况，那么这是一个相当容易的问题。 If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like   如果您不想破坏注释或CDATA的内容，或者您需要处理  when there's no definition of &npsp; 当没有&npsp;定义时 , then life starts to become rather more difficult. ，生活就会变得更加艰难。

Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. 当然，如果您编写符合标准的软件，则您和您的供应商可以节省大量时间和金钱。 That's what standards are for. 这就是标准的目的。

XML：仅对某些数据进行转义时如何进行预解析？

问题描述

3 个解决方案

解决方案1
3 已采纳 2017-06-19 15:44:26

解决方案2
0 2017-06-19 15:47:33

解决方案3
0 2017-06-19 15:47:39

XML：仅对某些数据进行转义时如何进行预解析？

问题描述

3 个解决方案

解决方案1 3 已采纳 2017-06-19 15:44:26

解决方案2 0 2017-06-19 15:47:33

解决方案3 0 2017-06-19 15:47:39

解决方案1
3 已采纳 2017-06-19 15:44:26

解决方案2
0 2017-06-19 15:47:33

解决方案3
0 2017-06-19 15:47:39