XMLStreamReader：获取字符偏移量：文件中的XML

Question

The XMLStreamReader ->Location has a method called getCharacterOffset() . XMLStreamReader - > Location有一个名为getCharacterOffset()的方法。

Unfortunately the Javadocs indicate this is an ambigously named method: it can also return a byte-offset (and this appears to be true in practice); 不幸的是，Javadocs表明这是一个非常有名的方法：它也可以返回一个字节偏移量（这在实践中似乎是真的）; unhelpfully this seems to occur when reading from files (for instance): 没有用，这似乎是从文件中读取时（例如）：

The Javadoc states : Javadoc说：

Return the byte or character offset into the input source this location is pointing to. 将字节或字符偏移量返回到此位置指向的输入源。 If the input source is a file or a byte stream then this is the byte offset into that stream, but if the input source is a character media then the offset is the character offset. 如果输入源是文件或字节流，那么这是该流的字节偏移量，但如果输入源是字符媒体，则偏移量是字符偏移量。 (emphasis added) （重点补充）

I really need the character offset; 我真的需要字符偏移; and I'm pretty sure I'm being given the byte offset instead. 而且我很确定我会得到字节偏移量。

The (UTF-8 encoded) XML is contained in a (partially corrupt 1G) file. （UTF-8编码的）XML包含在（部分损坏的1G）文件中。 [Hence the need to use a lower-level API which doesn't complain about the lack of well-formedness until it really has no choice but to]. [因此需要使用较低级别的API，它不会抱怨缺乏良好的形式，直到它真的别无选择，只有]。

Question 题

What does the Javadoc mean when it says '...input source is a character media...' : how can I force it to think of my input file as 'character media' - so that I get an accurate (Character) offset rather than a byte offset? 当Javadoc说“输入源是一个角色媒体......”时，Javadoc是什么意思：我怎么能强迫它把我的输入文件想象成'角色媒体' - 这样我才能获得准确的（字符）偏移量而不是字节偏移？

Extra blah blah: 额外的等等等等

[ I'm pretty sure this is what is going on - when I strip the file apart (using certain known high-level tags) I get a few characters missing or extra - in a non-accumalating way - I'm putting the difference down to a few multi-byte characters throwing off the counter: also when I copy (using 'head'/'tail' for instance in Powershell - this tool appears to correctly recognize [or assume UTF-8] and does a good conversion to UTF-16 as far as I can see ] [我很确定这是正在发生的事情 - 当我剥离文件时（使用某些已知的高级标签）我得到一些字符丢失或额外 - 以非准确的方式 - 我正在区别对待一些多字节字符丢掉计数器：当我复制时（例如在Powershell中使用'head'/'tail' - 这个工具似乎正确识别[或假设UTF-8]并且转换为我认为UTF-16]

Answer 1

The offset is in units of the underlying Source . 偏移量以底层Source为单位。

The XMLStreamReader only knows how many units it has read from the Source so the offset is calculated in those units. XMLStreamReader只知道它从Source读取了多少单位，因此偏移量以这些单位计算。

A Stream works in units of byte and therefore you end up with a byte offset. Stream以byte为单位工作，因此最终会产生byte偏移。

A Reader works in units of char and therefore you end up with an offset in char . Reader以char为单位工作，因此最终得到char的偏移量。

The docs for StreamSource are more explicit in what the terms "character media" means. StreamSource的文档更明确地表达了“角色媒体”的含义。

Maybe try something like 也许尝试类似的东西

final Source source = new StreamSource(new InputStreamReader(new FileInputStream(new File("my.xml")), "UTF-8"));
final XMLStreamReader xmlReader = XMLInputFactory.newFactory().createXMLStreamReader(source);

Answer 2

XMLInputFactory.createXMLStreamReader(java.io.InputStream) is a byte stream XMLInputFactory.createXMLStreamReader(java.io.InputStream)是一个字节流

XMLInputFactory.createXMLStreamReader(java.io.Reader) is a character stream XMLInputFactory.createXMLStreamReader(java.io.Reader)是一个字符流

XMLStreamReader：获取字符偏移量：文件中的XML

问题描述

2 个解决方案

解决方案1
3 已采纳 2013-04-12 14:49:50

解决方案2
1 2013-04-12 14:52:38

XMLStreamReader：获取字符偏移量：文件中的XML

问题描述

2 个解决方案

解决方案1 3 已采纳 2013-04-12 14:49:50

解决方案2 1 2013-04-12 14:52:38

解决方案1
3 已采纳 2013-04-12 14:49:50

解决方案2
1 2013-04-12 14:52:38