简体   繁体   中英

XMLStreamReader: get character offset : XML from file

The XMLStreamReader ->Location has a method called getCharacterOffset() .

Unfortunately the Javadocs indicate this is an ambigously named method: it can also return a byte-offset (and this appears to be true in practice); unhelpfully this seems to occur when reading from files (for instance):

The Javadoc states :

Return the byte or character offset into the input source this location is pointing to. If the input source is a file or a byte stream then this is the byte offset into that stream, but if the input source is a character media then the offset is the character offset. (emphasis added)

I really need the character offset; and I'm pretty sure I'm being given the byte offset instead.

The (UTF-8 encoded) XML is contained in a (partially corrupt 1G) file. [Hence the need to use a lower-level API which doesn't complain about the lack of well-formedness until it really has no choice but to].

Question

What does the Javadoc mean when it says '...input source is a character media...' : how can I force it to think of my input file as 'character media' - so that I get an accurate (Character) offset rather than a byte offset?

Extra blah blah:

[ I'm pretty sure this is what is going on - when I strip the file apart (using certain known high-level tags) I get a few characters missing or extra - in a non-accumalating way - I'm putting the difference down to a few multi-byte characters throwing off the counter: also when I copy (using 'head'/'tail' for instance in Powershell - this tool appears to correctly recognize [or assume UTF-8] and does a good conversion to UTF-16 as far as I can see ]

The offset is in units of the underlying Source .

The XMLStreamReader only knows how many units it has read from the Source so the offset is calculated in those units.

A Stream works in units of byte and therefore you end up with a byte offset.

A Reader works in units of char and therefore you end up with an offset in char .

The docs for StreamSource are more explicit in what the terms "character media" means.

Maybe try something like

final Source source = new StreamSource(new InputStreamReader(new FileInputStream(new File("my.xml")), "UTF-8"));
final XMLStreamReader xmlReader = XMLInputFactory.newFactory().createXMLStreamReader(source);

XMLInputFactory.createXMLStreamReader(java.io.InputStream) is a byte stream

XMLInputFactory.createXMLStreamReader(java.io.Reader) is a character stream

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM