简体   繁体   English

XmlDocument和XDocument处理之间的区别�

[英]Difference between XmlDocument and XDocument handling �

I have been trying to load a xml file which contains some null('&#x0') character. 我一直在尝试加载一个包含一些null('&#x0')字符的xml文件。 I have tried - 我努力了 -

  XmlDocument document = new XmlDocument();
  document.LoadXml(xmlString);

and

XDocument.Load(stringReader);

XmlDocument.LoadXml() method successfully loads xml document whereas XDocument.Load() method gives XmlException for same xml string. XmlDocument.LoadXml()方法成功加载xml文档,而XDocument.Load()方法为相同的xml字符串提供XmlException

Sample Code to Reproduce : 复制示例代码:

string xmlFile = @"C:\dummyData.xml";

        string xmlString = File.ReadAllText(xmlFile);

        XmlDocument document = new XmlDocument();
        document.LoadXml(xmlString); //Work

        XDocument.Parse(xmlString); // Didn't work.

        using (StringReader reader = new StringReader(xmlString))
        {
            XDocument.Load(reader);
        }

Xml File Xml文件

Copy the content of xml file from here 从这里复制xml文件的内容

A character reference � 字符引用� is not allowed in XML (at least XML 1.0 as supported by Microsoft). 不允许在XML中使用(Microsoft至少支持XML 1.0)。 However for legacy support I think an XmlTextReader or an XmlReader created with XmlReaderSettings to not check characters can load such markup. 但是,对于旧版支持,我认为XmlTextReader或使用XmlReader创建的不检查字符的XmlReaderSettings可以加载此类标记。 XmlDocument uses such an XmlReader while XDocument does not. XmlDocument使用此类XmlReader,而XDocument不使用。

Why � 为什么� is a problem 是个问题

As defined by W3C, Entities are 根据W3C的定义, 实体

CharRef ::=   '&#' [0-9]+ ';'
            | '&#x' [0-9a-fA-F]+ ';'

So at first sight, an entity like � 因此,乍一看,像�这样的实体 looks good. 看起来不错。

But you need to read the definition: 但是您需要阅读以下定义:

[Definition: A character reference refers to a specific character in the ISO/IEC 10646 character set, for example one not directly accessible from available input devices.] [定义:字符引用是指ISO / IEC 10646字符集中的特定字符,例如,不能从可用输入设备中直接访问的字符。]

So the character reference needs to point to ISO/IEC 10646 character, which is linked : 因此,字符参考需要指向链接的 ISO / IEC 10646字符:

Characters referred to using character references MUST match the production for Char . 使用字符引用引用的字符必须与Char的产生匹配。

Luckily Char is in the same document and defined as: 幸运的是Char在同一文档中,定义为:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Therefore, as mentioned by Martin Honnen before, neither the C# \\0 character nor the escaped versions � 因此,正如前面的Martin Honnen所提到的,C# \\0字符和转义版本� or � � are allowed in XML documents. 允许在XML文档中使用。

XML parser reality XML解析器现实

Some parsers might ignore parts of above rules and not fully adhere to the standard. 一些解析器可能会忽略上述规则的某些部分,并且未完全遵守该标准。

The real origin of the problem 问题的真正根源

The XML you posted seems to contain images/drawings: 您发布的XML似乎包含图像/绘图:

<?xml version="1.0" encoding="utf-8"?>
<TestData>
        <Images>
            <Drawings>
&lt;?xml version="1.0"?&gt;
&lt;ArrayOfMarkerState &gt;
&lt;/ArrayOfMarkerState&gt;
&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;&#x0;</Drawings>
        </Images>
            <Date>2015-10-20T17:19:05.2656609+05:30</Date>
</TestData>

The nature of pixel graphics is that they contain binary data. 像素图形的本质是它们包含二进制数据。

Developers who are not familiar with XML and faced with the problem of embedding binary data in XML will quickly think that any byte can be encoded as &#x00; 不熟悉XML并面临在XML中嵌入二进制数据的问题的开发人员会很快想到 ,任何字节都可以编码为&#x00; ... &#xFF; ... &#xFF; .

Unfortunately this is plain wrong. 不幸的是,这是完全错误的。 Why? 为什么? Well, because of the W3C definition above. 好吧,由于上面的W3C定义。

Other than that, this is even a bad idea regarding size. 除此之外,关于大小,这甚至是一个坏主意。 Even if it would work, a byte encoded like this will take 6 bytes in XML. 即使可行,以这种方式编码的字节在XML中也将占用6个字节。

Solving the original problem 解决原始问题

Binary data cannot go into XML documents as XML entities, so let's find something that works and needs less than +500% increase of size. 二进制数据不能作为XML实体进入XML文档,因此让我们找到一些行之有效的方法,并要求其大小增加不到+ 500%。

The answer is Base64 . 答案是Base64 Base64 has a rough increase of +33% in size. Base64的大小大约增加了33%。

Encoding the 47 &#x0; 编码47 &#x0; bytes would result in 字节将导致

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

which is only 64 bytes long, compared to the original 235 bytes. 与原来的235个字节相比,它只有64个字节长。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM