简体   繁体   English

XML:仅对某些数据进行转义时如何进行预解析?

[英]XML: how to pre-parse when only SOME data is escaped?

XML snippet: XML片段:

<field>&amp; is escaped</field>
<field>&quot;also escaped&quot;</field>
<field>is & "not" escaped</field>
<field>is &quot; and is not & escaped</field>

I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser? 我正在寻找有关如何通过解析器运行XML之前如何预解析任何XML以逃避所有未转义的内容的建议?

I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it. 我无法控制传递给我的XML,他们很可能不会很快修复它,因此我必须找到一种解析它的方法。

The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly 我遇到的主要问题是,按原样运行XML到解析器中,例如(下图)将抛出异常,这是由于XML不好(由于其中一些不能正确转义)

string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))

I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent. 我建议您使用正则表达式将未转义的“&”号替换为其等效实体。

This question is helpful as it gives you a Regex to find these rogue ampersands: 这个问题很有帮助,因为它为您提供了一个正则表达式来查找这些流氓“&”号:

&(?!(?:apos|quot|[gl]t|amp);|#)

And you can see that it matches the correct text in this demo . 您会看到它与此演示中的正确文本匹配。 You can use this in a simple replace operation: 您可以在简单的替换操作中使用它:

var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&amp;");

And then you'll be able to parse your XML. 然后,您将能够解析XML。

使用HTML Tidyquote-ampersand设置为true来预处理文本数据(不是XML)。

If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser. 如果您想解析非XML的内容,则首先需要准确确定该语言的含义以及您打算使用的语言:在为非XML语言编写了要处理的语法后,然后,您可以决定是否可以通过预处理来处理它,或者是否需要完整的解析器。

For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. 例如,如果您只需要处理未转义的“&”,后跟一个空格,并且如果您不关心注释和CDATA部分中发生的情况,那么这是一个相当容易的问题。 If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like &nbsp; 如果您不想破坏注释或CDATA的内容,或者您​​需要处理&nbsp; when there's no definition of &npsp; 当没有&npsp;定义时 , then life starts to become rather more difficult. ,生活就会变得更加艰难。

Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. 当然,如果您编写符合标准的软件,则您和您的供应商可以节省大量时间和金钱。 That's what standards are for. 这就是标准的目的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM