如何解析包含BOM的XML文件？

Question

I want to parse an XML file from URL using JDOM. 我想使用JDOM从URL解析XML文件。 But when trying this: 但是在尝试这个时：

SAXBuilder builder = new SAXBuilder();
builder.build(aUrl);

I get this exception: 我得到这个例外：

Invalid byte 1 of 1-byte UTF-8 sequence.

I thought this might be the BOM issue. 我认为这可能是BOM问题。 So I checked the source and saw the BOM in the beginning of the file. 所以我查看了源代码并在文件开头看到了BOM。 I tried reading from URL using aUrl.openStream() and removing the BOM with Commons IO BOMInputStream . 我尝试使用aUrl.openStream()从URL读取并使用Commons IO BOMInputStream删除BOM。 But to my surprise it didn't detect any BOM. 但令我惊讶的是它没有检测到任何BOM。 I tried reading from the stream and writing to a local file and parse the local file. 我尝试从流中读取并写入本地文件并解析本地文件。 I set all the encodings for InputStreamReader and OutputStreamWriter to UTF8 but when I opened the file it had crazy characters. 我将InputStreamReader和OutputStreamWriter的所有编码设置为UTF8但是当我打开文件时它有疯狂的字符。

I thought the problem is with the source URL encoding. 我认为问题在于源URL编码。 But when I open the URL in browser and save the XML in a file and read that file through the process I described above, everything works fine. 但是当我在浏览器中打开URL并将XML保存在文件中并通过上述过程读取该文件时，一切正常。

I appreciate any help on the possible cause of this issue. 我对此问题的可能原因表示感谢。

Answer 1

That HTTP server is sending the content in GZIPped form ( Content-Encoding: gzip ; see http://en.wikipedia.org/wiki/HTTP_compression if you don't know what that means), so you need to wrap aUrl.openStream() in a GZIPInputStream that will decompress it for you. HTTP服务器正在以GZIP形式发送内容（ Content-Encoding: gzip ;如果您不知道这意味着什么，请参阅http://en.wikipedia.org/wiki/HTTP_compression ），因此您需要包装aUrl.openStream()在GZIPInputStream中为你解压缩它。 For example: 例如：

builder.build(new GZIPInputStream(aUrl.openStream()));

Edited to add , based on the follow-up comment: If you don't know in advance whether the URL will be GZIPped, you can write something like this: 根据后续评论编辑添加 ：如果您事先不知道URL是否为GZIPped，您可以写下这样的内容：

private InputStream openStream(final URL url) throws IOException
{
    final URLConnection cxn = url.openConnection();
    final String contentEncoding = cxn.getContentEncoding();
    if(contentEncoding == null)
        return cxn.getInputStream();
    else if(contentEncoding.equalsIgnoreCase("gzip")
               || contentEncoding.equalsIgnoreCase("x-gzip"))
        return new GZIPInputStream(cxn.getInputStream());
    else
        throw new IOException("Unexpected content-encoding: " + contentEncoding);
}

(warning: not tested) and then use: （警告：未经测试）然后使用：

builder.build(openStream(aUrl.openStream()));

. 。 This is basically equivalent to the above — aUrl.openStream() is explicitly documented to be a shorthand for aUrl.openConnection().getInputStream() — except that it examines the Content-Encoding header before deciding whether to wrap the stream in a GZIPInputStream . 这基本上等同于上面的内容 - aUrl.openStream()被明确记录为aUrl.openConnection().getInputStream()的简写 - 除了它在决定是否在GZIPInputStream包装流之前检查Content-Encoding头。。

See the documentation for java.net.URLConnection . 请参阅java.net.URLConnection的文档。

Answer 2

You might find you can avoid handling encoded responses by sending a blank Accept-Encoding header. 您可能会发现通过发送空白的Accept-Encoding标头可以避免处理编码的响应。 See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html : "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding.". 请参阅http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html :"如果请求中不存在Accept-Encoding字段，则服务器可以假定客户端将接受任何内容编码。“ That seems to be occurring here. 这似乎发生在这里。

如何解析包含BOM的XML文件？

问题描述

2 个解决方案

解决方案1
4 已采纳 2011-12-12 22:29:14

解决方案2
0 2011-12-12 23:26:04

如何解析包含BOM的XML文件？

问题描述

2 个解决方案

解决方案1 4 已采纳 2011-12-12 22:29:14

解决方案2 0 2011-12-12 23:26:04

解决方案1
4 已采纳 2011-12-12 22:29:14

解决方案2
0 2011-12-12 23:26:04