简体   繁体   English

XML文档中有效的第一个字符集是什么?

[英]What is the set of valid first characters in an XML document?

I'm working on some code to determine the character encoding of an XML document being returned by a web server (an RSS feed in this particular case). 我正在处理一些代码,以确定Web服务器(在这种情况下为RSS提要)返回的XML文档的字符编码。 Unfortunately, sometimes the web server lies and tells me that the document is UTF-8 when in fact it's not, or the boilerplate XML generation code on the server has <?xml encoding='UTF-8'?> at the start but the document contains invalid UTF-8 byte sequences. 不幸的是,有时Web服务器在说谎,并告诉我该文档实际上是UTF-8,或者不是,或者服务器上的样板XML生成代码在开始时带有<?xml encoding='UTF-8'?> ,但是文档包含无效的UTF-8字节序列。

Since I don't have control over the server, I need to make my client code tolerate this kind of inconsistency and show something , even if some of the characters are not decoded correctly. 由于我无法控制服务器,因此即使某些字符未正确解码,也需要使我的客户端代码能够忍受这种不一致并显示一些内容 This is an important requirement for my application. 这是我的应用程序的重要要求。

I'm well aware that the server is violating the XML spec in this case. 我很清楚这种情况下服务器违反了XML规范。 I try to work with the server side developers when possible to make things correct according to the spec, but sometimes this is a low priority for them or for their organization, or the server side code is not actively maintained by anyone. 我会尽可能与服务器端开发人员合作,以根据规范使事情变得正确,但是有时这对于他们或他们的组织而言是低优先级的,或者任何人都不会积极维护服务器端代码。

In order to be robust, I want to look at the first few bytes of the XML data and try to determine if it's some form of UTF-16 or some 8-bit encoding. 为了变得健壮,我想查看XML数据的前几个字节,并尝试确定它是某种形式的UTF-16还是某种8位编码。 I already have code that looks for a byte order mark (BOM). 我已经有寻找字节顺序标记(BOM)的代码。

But sometimes the server doesn't include a BOM, even for UTF-16. 但是有时候,即使对于UTF-16,服务器也不包含BOM。 I want to try and figure out if it's UTF-16 or not by looking at the first two bytes and checking them against the list of possible first characters in an XML document. 我想通过查看前两个字节并对照XML文档中可能的第一个字符列表检查它们是否为UTF-16。

Obviously I have to draw the line somewhere. 显然我必须在某处画线。 If the document is not well-formed XML I won't be able to parse it anyway unless I write my own very tolerant parser (which I'm not planning to do). 如果该文档不是格式正确的XML,则除非我编写了自己的非常宽容的解析器(我不打算这样做),否则我还是无法解析它。 But given that it's well-formed, what could I possibly see in the first character of the document aside from a BOM? 但是考虑到它的格式正确,除了BOM之外,我在文档的第一个字符中还能看到什么?

So far as I can tell from looking at the spec, this set would be: whitespace (space, tab, new line, carriage return) and '<'. 据我从规范看,该集合应该是:空格(空格,制表符,换行,回车)和'<'。 Do any XML experts out there know of anything I might be missing? 是否有任何XML专家知道我可能缺少的任何信息? I need to assume that the <?xml?> declaration may not be present even if required by the spec. 我需要假设即使规范要求也可能不存在<?xml?>声明。

Internal DTDs, processing instructions, tags and comments all start with '<'. 内部DTD,处理指令,标签和注释均以“ <”开头。 Is it possible to have an entity (starting with '&') or something else at the start of a document? 在文档的开头是否可以有一个实体(以“&”开头)或其他?

EDIT: Rewritten to emphasize my particular requirements. 编辑:重写以强调我的特殊要求。

The XML Specification provides some guidance about detecting character encodings. XML规范提供有关检测字符编码的一些指导 The problem is that it is nearly impossible to look at the first few bytes and tell if it is UTF-8 or ISO-8859-1 or CP437 for that matter. 问题在于,几乎不可能查看前几个字节,然后就知道它是UTF-8还是ISO-8859-1或CP437。 The information that the spec contains will at least let you distinguish well-formed documents. 规范包含的信息至少可以让您区分格式正确的文档。

The trouble is that if a feed is invalid, it probably doesn't obey any rules about legal characters. 问题在于,如果提要无效,则它可能不遵守有关合法字符的任何规则。 Take a look at the code for the Universal Feed Parser . 看一下Universal Feed Parser的代码。 It's very well-tested code for parsing garbage text into possibly-correct data structures. 这是经过良好测试的代码,用于将垃圾文本解析为可能正确的数据结构。

The UFP uses a sub-library named Universal Encoding Detector , which should contain useful information for general encoding detection. UFP使用一个名为Universal Encoding Detector的子库,该子库应包含有用的信息以进行常规编码检测。

It's not ideal, but I sometimes do this when I need to cope with bad encodings (pseduo-code alert). 这并不理想,但是当我需要处理不良编码(pseduo代码警报)时,有时会这样做。

str = decode("utf-8", input)
if (!str) {
  str = decode("cp1252", input)
}

That is, try to interpret the input as UTF-8, and if it fails, treat it as coming from a Windows system (which it probably is). 也就是说,尝试将输入解释为UTF-8,如果输入失败,则将其视为来自Windows系统(可能是)。 It seems like a reasonable compromise to me. 对我来说,这似乎是一个合理的妥协。

Of course, this does require that you download the entire input into memory first, which may not be practical. 当然,这确实需要您首先将整个输入下载到内存中,这可能不切实际。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM