简体   繁体   English

将 xml 字符串解析为 xml 文档失败,如果字符串以<?xml… ?>部分

[英]Parsing xml string to an xml document fails if the string begins with <?xml… ?> section

I have an XML file begining like this:我有一个像这样开头的 XML 文件:

<?xml version="1.0" encoding="utf-8"?>
<Report xmlns:rd="http://schemas.microsoft.com/SQLServer/reporting/reportdesigner" xmlns="http://schemas.microsoft.com/sqlserver/reporting/2008/01/reportdefinition">
  <DataSources>

When I run following code:当我运行以下代码时:

byte[] fileContent = //gets bytes
            string stringContent = Encoding.UTF8.GetString(fileContent);
            XDocument xml = XDocument.Parse(stringContent);

I get following XmlException:我得到以下 XmlException:

Data at the root level is invalid.根级别的数据无效。 Line 1, position 1.第 1 行,位置 1。

Cutting out the version and encoding node fixes the problem.删除版本和编码节点可以解决问题。 Why?为什么? How to process this xml correctly?如何正确处理这个xml?

My first thought was that the encoding is Unicode when parsing XML from a .NET string type.我的第一个想法是从 .NET 字符串类型解析 XML 时编码是 Unicode。 It seems, though that XDocument's parsing is quite forgiving with respect to this.看起来,尽管 XDocument 的解析对此相当宽容。

The problem is actually related to the UTF8 preamble/byte order mark (BOM), which is a three-byte signature optionally present at the start of a UTF-8 stream.该问题实际上与 UTF8 前导码/字节顺序标记 (BOM) 相关,它是一个三字节的签名,可选择出现在 UTF-8 流的开头。 These three bytes are a hint as to the encoding being used in the stream.这三个字节是有关流中使用的编码的提示。

You can determine the preamble of an encoding by calling the GetPreamble method on an instance of the System.Text.Encoding class.您可以通过对System.Text.Encoding类的实例调用GetPreamble方法来确定编码的前导码。 For example:例如:

// returns { 0xEF, 0xBB, 0xBF }
byte[] preamble = Encoding.UTF8.GetPreamble();

The preamble should be handled correctly by XmlTextReader , so simply load your XDocument from an XmlTextReader : XmlTextReader应该正确处理序言,因此只需从XmlTextReader加载您的XDocument

XDocument xml;
using (var xmlStream = new MemoryStream(fileContent))
using (var xmlReader = new XmlTextReader(xmlStream))
{
    xml = XDocument.Load(xmlReader);
}

If you only have bytes you could either load the bytes into a stream:如果您只有字节,则可以将字节加载到流中:

XmlDocument oXML;

using (MemoryStream oStream = new MemoryStream(oBytes))
{
  oXML = new XmlDocument();
  oXML.Load(oStream);
}

Or you could convert the bytes into a string (presuming that you know the encoding) before loading the XML:或者您可以在加载 XML 之前将字节转换为字符串(假设您知道编码):

string sXml;
XmlDocument oXml;

sXml = Encoding.UTF8.GetString(oBytes);
oXml = new XmlDocument();
oXml.LoadXml(sXml);

I've shown my example as .NET 2.0 compatible, if you're using .NET 3.5 you can use XDocument instead of XmlDocument .我已将我的示例显示为与 .NET 2.0 兼容,如果您使用 .NET 3.5,则可以使用XDocument而不是XmlDocument

Load the bytes into a stream:将字节加载到流中:

XDocument oXML;

using (MemoryStream oStream = new MemoryStream(oBytes))
using (XmlTextReader oReader = new XmlTextReader(oStream))
{
  oXML = XDocument.Load(oReader);
}

Convert the bytes into a string:将字节转换为字符串:

string sXml;
XDocument oXml;

sXml = Encoding.UTF8.GetString(oBytes);
oXml = XDocument.Parse(sXml);

Do you have a byte-order-mark (BOM) at the beginning of your XML, and does it match your encoding ?您的 XML 开头是否有字节顺序标记(BOM),它是否与您的编码匹配? If you chop out your header, you'll also chop out the BOM and if that is incorrect, then subsequent parsing may work.如果你砍掉你的标题,你也会砍掉 BOM,如果这是不正确的,那么后续的解析可能会起作用。

You may need to inspect your document at the byte level to see the BOM.您可能需要在字节级别检查您的文档以查看 BOM。

Why bothering to read the file as a byte sequence and then converting it to string while it is an xml file?为什么要费心将文件作为字节序列读取,然后在它是 xml 文件时将其转换为字符串? Just leave the framework do the loading for you and cope with the encodings:只需让框架为您加载并处理编码:

var xml = XDocument.Load("test.xml");

Try this:尝试这个:

int startIndex = xmlString.IndexOf('<');
if (startIndex > 0)
{
    xmlString = xmlString.Remove(0, startIndex);
}

I have also encountered this error because the source XML was a string that somehow got some non-printable characters that seemed to break XmlDocument or XDocument parsing.我也遇到过这个错误,因为源 XML 是一个字符串,它以某种方式获得了一些似乎破坏XmlDocumentXDocument解析的不可打印字符。 Stripping them fixed the issue:剥离它们解决了这个问题:

string sanitized = Regex.Replace(part, @"\p{C}+", string.Empty);

Credit: C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters信用: C# regex 删除不可打印的字符和控制字符,在混合了许多不同语言的文本中,unicode 字母

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM