[英]How to parse an XML file containing BOM?
I want to parse an XML file from URL using JDOM. 我想使用JDOM从URL解析XML文件。 But when trying this:
但是在尝试这个时:
SAXBuilder builder = new SAXBuilder();
builder.build(aUrl);
I get this exception: 我得到这个例外:
Invalid byte 1 of 1-byte UTF-8 sequence.
I thought this might be the BOM issue. 我认为这可能是BOM问题。 So I checked the source and saw the BOM in the beginning of the file.
所以我查看了源代码并在文件开头看到了BOM。 I tried reading from URL using
aUrl.openStream()
and removing the BOM with Commons IO BOMInputStream . 我尝试使用
aUrl.openStream()
从URL读取并使用Commons IO BOMInputStream删除BOM。 But to my surprise it didn't detect any BOM. 但令我惊讶的是它没有检测到任何BOM。 I tried reading from the stream and writing to a local file and parse the local file.
我尝试从流中读取并写入本地文件并解析本地文件。 I set all the encodings for InputStreamReader and OutputStreamWriter to UTF8 but when I opened the file it had crazy characters.
我将InputStreamReader和OutputStreamWriter的所有编码设置为UTF8但是当我打开文件时它有疯狂的字符。
I thought the problem is with the source URL encoding. 我认为问题在于源URL编码。 But when I open the URL in browser and save the XML in a file and read that file through the process I described above, everything works fine.
但是当我在浏览器中打开URL并将XML保存在文件中并通过上述过程读取该文件时,一切正常。
I appreciate any help on the possible cause of this issue. 我对此问题的可能原因表示感谢。
That HTTP server is sending the content in GZIPped form ( Content-Encoding: gzip
; see http://en.wikipedia.org/wiki/HTTP_compression if you don't know what that means), so you need to wrap aUrl.openStream()
in a GZIPInputStream
that will decompress it for you. HTTP服务器正在以GZIP形式发送内容(
Content-Encoding: gzip
;如果您不知道这意味着什么,请参阅http://en.wikipedia.org/wiki/HTTP_compression ),因此您需要包装aUrl.openStream()
在GZIPInputStream
中为你解压缩它。 For example: 例如:
builder.build(new GZIPInputStream(aUrl.openStream()));
Edited to add , based on the follow-up comment: If you don't know in advance whether the URL will be GZIPped, you can write something like this: 根据后续评论编辑添加 :如果您事先不知道URL是否为GZIPped,您可以写下这样的内容:
private InputStream openStream(final URL url) throws IOException
{
final URLConnection cxn = url.openConnection();
final String contentEncoding = cxn.getContentEncoding();
if(contentEncoding == null)
return cxn.getInputStream();
else if(contentEncoding.equalsIgnoreCase("gzip")
|| contentEncoding.equalsIgnoreCase("x-gzip"))
return new GZIPInputStream(cxn.getInputStream());
else
throw new IOException("Unexpected content-encoding: " + contentEncoding);
}
(warning: not tested) and then use: (警告:未经测试)然后使用:
builder.build(openStream(aUrl.openStream()));
. 。 This is basically equivalent to the above —
aUrl.openStream()
is explicitly documented to be a shorthand for aUrl.openConnection().getInputStream()
— except that it examines the Content-Encoding
header before deciding whether to wrap the stream in a GZIPInputStream
. 这基本上等同于上面的内容 -
aUrl.openStream()
被明确记录为aUrl.openConnection().getInputStream()
的简写 - 除了它在决定是否在GZIPInputStream
包装流之前检查Content-Encoding
头。 。
See the documentation for java.net.URLConnection
. 请参阅
java.net.URLConnection
的文档 。
You might find you can avoid handling encoded responses by sending a blank Accept-Encoding header. 您可能会发现通过发送空白的Accept-Encoding标头可以避免处理编码的响应。 See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html : "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding.".
请参阅http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html :"如果请求中不存在Accept-Encoding字段,则服务器可以假定客户端将接受任何内容编码。“ That seems to be occurring here.
这似乎发生在这里。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.