简体   繁体   中英

Handling inputs with unsupported and/or multiple charsets in Java

I am writing a Java (7 SE) app to ingest all sorts of text-based inputs, and am concerned about running into character sets/encodings that the JRE doesn't support (for instance this app will run on a Linux box but will be ingesting files generated on every major OS, etc.).

For one, is there a way to catch an IOException (or similar) if the InputStreamReader encounters an unsupported charset/encoding?

And what about inputs that contain multiple encodings? Say we have 4 different types of inputs:

  • Raw java.lang.String s
  • Plaintext ( .txt ) files
  • Word ( .docx ) files
  • PDF files

What if we're reading one of these inputs and we start encountering multiple (but supported) character encodings? Does the JRE natively handle this, or do I have to have multiple readers, each configured with it's own charset/encoding?

In such a case, could I "normalize" the streaming inputs to a single, standardized (UTF-8 most likely) set/encoding? Thanks in advance.

To answer your first question you can create a CharsetDecoder and specify what you want to happen when you encounter malformed input.

CharsetDecoder charsetDecoder = Charset.forName("utf-8").newDecoder();
charsetDecoder.onMalformedInput(myCustomErrorAction);
charsetDecoder.onUnmappableCharacter(myCustomErrorAction);
Reader inputReader = new InputStreamReader(inputStream, charsetDecoder);

As for catching a case where an entire charset is not supported it would look something like:

if( Charset.isSupported(encodingSpecified)) {
    //Normal case
} else {
    //Error case
}

I'm not sure about multiple encodings however. I would think it is extremely unusual for a single binary stream to have multiple encodings. The stream would have to have some custom way of indicating the encoding change. You would have to read from the stream one character at a time looking for that indicator. If you encountered it you would then have to create a new reader on the same stream with the new encoding.

In all cases, in Java, once you go from a stream of bytes to a stream of characters those characters are going to be represented in memory without any specific encoding so there is no need to normalize unless you're saving the data back out somewhere. If you are going to save that data back out to a file later then I would highly recommend you pick one encoding and stick with it for storing all your data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM