简体   繁体   中英

UTF-16LE encoding and xerces2 Java

I went through a few posts, like FileReader reads the file as a character stream and can be treated as whitespace if the document is handed as a stream of characters where the answers say the input source is actually a char stream, not a byte stream.

However, the suggested solution from 1 does not seem to apply to UTF-16LE. Although I use this code:

    try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
      DOMParser parser = new org.apache.xerces.parsers.DOMParser();
      parser.parse(new InputSource(is));
      return parser.getDocument();
    } catch (final SAXParseException saxEx) {
      LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
    }

I still get org.xml.sax.SAXParseException: Content is not allowed in prolog. .

I looked at Files.newInputStream, and it indeed uses a ChannelInputStream which will hand over bytes, not chars. I also tried to set the Encoding of the InputSource object, but with no luck. I also checked that there are not extra chars (except the BOM) before the <?xml part.

I also want to mention that this code works just fine with UTF-8.

// Edit: I also tried DocumentBuilderFactory.newInstance().newDocumentBuilder().parse() and XmlInputStreamReader.next(), same results.

// Edit 2: Tried using a buffered reader. Same results: Unexpected character '뿯' (code 49135 / 0xbfef) in prolog; expected '<'

Thanks in advance.

To get a bit farther some info gathering:

byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
    LOG.info("Has BOM and is evidently UTF_16LE");
    xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
    LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
    declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);

try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
  DOMParser parser = new org.apache.xerces.parsers.DOMParser();
  parser.parse(new InputSource(is));
  return parser.getDocument();
} catch (final SAXParseException saxEx) {
  LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM