简体   繁体   English

如何解析包含BOM的XML文件?

[英]How to parse an XML file containing BOM?

I want to parse an XML file from URL using JDOM. 我想使用JDOM从URL解析XML文件。 But when trying this: 但是在尝试这个时:

SAXBuilder builder = new SAXBuilder();
builder.build(aUrl);

I get this exception: 我得到这个例外:

Invalid byte 1 of 1-byte UTF-8 sequence.

I thought this might be the BOM issue. 我认为这可能是BOM问题。 So I checked the source and saw the BOM in the beginning of the file. 所以我查看了源代码并在文件开头看到了BOM。 I tried reading from URL using aUrl.openStream() and removing the BOM with Commons IO BOMInputStream . 我尝试使用aUrl.openStream()从URL读取并使用Commons IO BOMInputStream删除BOM。 But to my surprise it didn't detect any BOM. 但令我惊讶的是它没有检测到任何BOM。 I tried reading from the stream and writing to a local file and parse the local file. 我尝试从流中读取并写入本地文件并解析本地文件。 I set all the encodings for InputStreamReader and OutputStreamWriter to UTF8 but when I opened the file it had crazy characters. 我将InputStreamReaderOutputStreamWriter的所有编码设置为UTF8但是当我打开文件时它有疯狂的字符。

I thought the problem is with the source URL encoding. 我认为问题在于源URL编码。 But when I open the URL in browser and save the XML in a file and read that file through the process I described above, everything works fine. 但是当我在浏览器中打开URL并将XML保存在文件中并通过上述过程读取该文件时,一切正常。

I appreciate any help on the possible cause of this issue. 我对此问题的可能原因表示感谢。

That HTTP server is sending the content in GZIPped form ( Content-Encoding: gzip ; see http://en.wikipedia.org/wiki/HTTP_compression if you don't know what that means), so you need to wrap aUrl.openStream() in a GZIPInputStream that will decompress it for you. HTTP服务器正在以GZIP形式发送内容( Content-Encoding: gzip ;如果您不知道这意味着什么,请参阅http://en.wikipedia.org/wiki/HTTP_compression ),因此您需要包装aUrl.openStream()GZIPInputStream中为你解压缩它。 For example: 例如:

builder.build(new GZIPInputStream(aUrl.openStream()));

Edited to add , based on the follow-up comment: If you don't know in advance whether the URL will be GZIPped, you can write something like this: 根据后续评论编辑添加 :如果您事先不知道URL是否为GZIPped,您可以写下这样的内容:

private InputStream openStream(final URL url) throws IOException
{
    final URLConnection cxn = url.openConnection();
    final String contentEncoding = cxn.getContentEncoding();
    if(contentEncoding == null)
        return cxn.getInputStream();
    else if(contentEncoding.equalsIgnoreCase("gzip")
               || contentEncoding.equalsIgnoreCase("x-gzip"))
        return new GZIPInputStream(cxn.getInputStream());
    else
        throw new IOException("Unexpected content-encoding: " + contentEncoding);
}

(warning: not tested) and then use: (警告:未经测试)然后使用:

builder.build(openStream(aUrl.openStream()));

. This is basically equivalent to the above — aUrl.openStream() is explicitly documented to be a shorthand for aUrl.openConnection().getInputStream() — except that it examines the Content-Encoding header before deciding whether to wrap the stream in a GZIPInputStream . 这基本上等同于上面的内容 - aUrl.openStream()被明确记录为aUrl.openConnection().getInputStream()的简写 - 除了它在决定是否在GZIPInputStream包装流之前检查Content-Encoding头。 。

See the documentation for java.net.URLConnection . 请参阅java.net.URLConnection的文档

You might find you can avoid handling encoded responses by sending a blank Accept-Encoding header. 您可能会发现通过发送空白的Accept-Encoding标头可以避免处理编码的响应。 See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html : "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding.". 请参阅http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html :"如果请求中不存在Accept-Encoding字段,则服务器可以假定客户端将接受任何内容编码。“ That seems to be occurring here. 这似乎发生在这里。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM