简体   繁体   中英

JAXB & UTF-8 Unmarshal exception “Invalid byte 2 of 2-byte UTF-8 sequence”

I've read a few SO answers that say that JAXB has a bug that it blames on XML's nature which cause it to not work with UTF-8. My question is, so what is the workaround? I may get unicode character entered by my users copying and pasting into a data field that I need to preserve, marshal, unmarshal, and re-display elsewhere.

(update) More Context:

Candidate c = new Candidate();
c.addSubstitution("3 4ths", "\u00BE");
c.addSubstitution("n with tilde", "\u00F1");
    c.addSubstitution("schwa", "\u018F");
    c.addSubstitution("Sigma", "\u03A3");
    c.addSubstitution("Cyrillic Th", "\u040B");     
    jc = JAXBContext.newInstance(Candidate.class);
    Marshaller marshaller = jc.createMarshaller();
    marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
    marshaller.setProperty(Marshaller.JAXB_ENCODING, "UTF-8");
    ByteArrayOutputStream os = new ByteArrayOutputStream();
    marshaller.marshal(c, os);
    String xml = os.toString();
    System.out.println(xml);    
    jc = JAXBContext.newInstance(Candidate.class);
    Unmarshaller jaxb = jc.createUnmarshaller();
    ByteArrayInputStream is = new ByteArrayInputStream(xml.getBytes());
    Candidate newCandidate = (Candidate) jaxb.unmarshal(is);
    for(Substitution s:c.getSubstitutions()) {
        System.out.println(s.getSubstitutionName() + "='" + s.getSubstitutionValue() + "'");
    }

Here's a little test bit I threw together. The exact characters I get are not entirely under my control. users may paste a N with tilde into the field or whatever.

This is the problem in your test code:

ByteArrayInputStream is = new ByteArrayInputStream(xml.getBytes());

You're using the platform default encoding to convert the string to a byte array. Don't do that. You've specified that you're going to use UTF-8, so you must do so when you create the byte array:

ByteArrayInputStream is = new ByteArrayInputStream(xml.getBytes("UTF-8"));

Likewise don't use ByteArrayOutputStream.toString() , which again uses the platform default encoding. Indeed, you don't need to convert the output to a string at all:

ByteArrayOutputStream os = new ByteArrayOutputStream();
marshaller.marshal(c, os);
byte[] xml = os.toByteArray();
jc = JAXBContext.newInstance(Candidate.class);
Unmarshaller jaxb = jc.createUnmarshaller();
ByteArrayInputStream is = new ByteArrayInputStream(xml);

This should have no problems with the characters you're using - it will still have problems which can't be represented in XML 1.0 (characters below U+0020 other than \\r , \\n and \\t ) but that's all.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM