[英]What is the set of valid first characters in an XML document?
I'm working on some code to determine the character encoding of an XML document being returned by a web server (an RSS feed in this particular case). 我正在处理一些代码,以确定Web服务器(在这种情况下为RSS提要)返回的XML文档的字符编码。 Unfortunately, sometimes the web server lies and tells me that the document is UTF-8 when in fact it's not, or the boilerplate XML generation code on the server has
<?xml encoding='UTF-8'?>
at the start but the document contains invalid UTF-8 byte sequences. 不幸的是,有时Web服务器在说谎,并告诉我该文档实际上是UTF-8,或者不是,或者服务器上的样板XML生成代码在开始时带有
<?xml encoding='UTF-8'?>
,但是文档包含无效的UTF-8字节序列。
Since I don't have control over the server, I need to make my client code tolerate this kind of inconsistency and show something , even if some of the characters are not decoded correctly. 由于我无法控制服务器,因此即使某些字符未正确解码,也需要使我的客户端代码能够忍受这种不一致并显示一些内容 。 This is an important requirement for my application.
这是我的应用程序的重要要求。
I'm well aware that the server is violating the XML spec in this case. 我很清楚这种情况下服务器违反了XML规范。 I try to work with the server side developers when possible to make things correct according to the spec, but sometimes this is a low priority for them or for their organization, or the server side code is not actively maintained by anyone.
我会尽可能与服务器端开发人员合作,以根据规范使事情变得正确,但是有时这对于他们或他们的组织而言是低优先级的,或者任何人都不会积极维护服务器端代码。
In order to be robust, I want to look at the first few bytes of the XML data and try to determine if it's some form of UTF-16 or some 8-bit encoding. 为了变得健壮,我想查看XML数据的前几个字节,并尝试确定它是某种形式的UTF-16还是某种8位编码。 I already have code that looks for a byte order mark (BOM).
我已经有寻找字节顺序标记(BOM)的代码。
But sometimes the server doesn't include a BOM, even for UTF-16. 但是有时候,即使对于UTF-16,服务器也不包含BOM。 I want to try and figure out if it's UTF-16 or not by looking at the first two bytes and checking them against the list of possible first characters in an XML document.
我想通过查看前两个字节并对照XML文档中可能的第一个字符列表检查它们是否为UTF-16。
Obviously I have to draw the line somewhere. 显然我必须在某处画线。 If the document is not well-formed XML I won't be able to parse it anyway unless I write my own very tolerant parser (which I'm not planning to do).
如果该文档不是格式正确的XML,则除非我编写了自己的非常宽容的解析器(我不打算这样做),否则我还是无法解析它。 But given that it's well-formed, what could I possibly see in the first character of the document aside from a BOM?
但是考虑到它的格式正确,除了BOM之外,我在文档的第一个字符中还能看到什么?
So far as I can tell from looking at the spec, this set would be: whitespace (space, tab, new line, carriage return) and '<'. 据我从规范看,该集合应该是:空格(空格,制表符,换行,回车)和'<'。 Do any XML experts out there know of anything I might be missing?
是否有任何XML专家知道我可能缺少的任何信息? I need to assume that the
<?xml?>
declaration may not be present even if required by the spec. 我需要假设即使规范要求也可能不存在
<?xml?>
声明。
Internal DTDs, processing instructions, tags and comments all start with '<'. 内部DTD,处理指令,标签和注释均以“ <”开头。 Is it possible to have an entity (starting with '&') or something else at the start of a document?
在文档的开头是否可以有一个实体(以“&”开头)或其他?
EDIT: Rewritten to emphasize my particular requirements. 编辑:重写以强调我的特殊要求。
The XML Specification provides some guidance about detecting character encodings. XML规范提供了有关检测字符编码的一些指导 。 The problem is that it is nearly impossible to look at the first few bytes and tell if it is UTF-8 or ISO-8859-1 or CP437 for that matter.
问题在于,几乎不可能查看前几个字节,然后就知道它是UTF-8还是ISO-8859-1或CP437。 The information that the spec contains will at least let you distinguish well-formed documents.
规范包含的信息至少可以让您区分格式正确的文档。
The trouble is that if a feed is invalid, it probably doesn't obey any rules about legal characters. 问题在于,如果提要无效,则它可能不遵守有关合法字符的任何规则。 Take a look at the code for the Universal Feed Parser .
看一下Universal Feed Parser的代码。 It's very well-tested code for parsing garbage text into possibly-correct data structures.
这是经过良好测试的代码,用于将垃圾文本解析为可能正确的数据结构。
The UFP uses a sub-library named Universal Encoding Detector , which should contain useful information for general encoding detection. UFP使用一个名为Universal Encoding Detector的子库,该子库应包含有用的信息以进行常规编码检测。
It's not ideal, but I sometimes do this when I need to cope with bad encodings (pseduo-code alert). 这并不理想,但是当我需要处理不良编码(pseduo代码警报)时,有时会这样做。
str = decode("utf-8", input)
if (!str) {
str = decode("cp1252", input)
}
That is, try to interpret the input as UTF-8, and if it fails, treat it as coming from a Windows system (which it probably is). 也就是说,尝试将输入解释为UTF-8,如果输入失败,则将其视为来自Windows系统(可能是)。 It seems like a reasonable compromise to me.
对我来说,这似乎是一个合理的妥协。
Of course, this does require that you download the entire input into memory first, which may not be practical. 当然,这确实需要您首先将整个输入下载到内存中,这可能不切实际。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.